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Abstract — In 1974 Kolmogorov proposed a non- 
probabilistic approach to statistics and model selection. Let 
data be finite binary strings and models be finite sets of bi- 
nary strings. Consider model classes consisting of models 
of given maximal (Kolmogorov) complexity. The "struc- 
ture function" of the given data expresses the relation be- 
tween the complexity level constraint on a model class and 
the least log-cardinality of a model in the class containing 
the data. We show that the structure function determines 
all stochastic properties of the data: for every constrained 
model class it determines the individual best-fitting model 
in the class irrespective of whether the "true" model is in 
the model class considered or not. In this setting, this hap- 
pens with certainty, rather than with high probability as is in 
the classical case. We precisely quantify the goodness-of-fit 
of an individual model with respect to individual data. We 
show that — within the obvious constraints — every graph is 
realized by the structure function of some data. We deter- 
mine the (un)computability properties of the various func- 
tions contemplated and of the "algorithmic minimal suffi- 
cient statistic." 

Index Terms — 

constrained minimum description length (ML) con- 
strained maximum likelihood (MDL) constrained best-fit 
model selection computability lossy compression minimal 
sufficient statistic non-probabilistic statistics Kolmogorov 
complexity, Kolmogorov Structure function prediction suf- 
ficient statistic 



I. Introduction 

As perhaps the last mathematical innovation of an ex- 
traordinary scientific career, A.N. Kolmogorov ^2], [T?>j 

■ proposed to found statistical theory on finite combinato- 

■ rial principles independent of probabilistic assumptions. 
' Technically, the new statistics is expressed in terms of Kol- 
I mogorov complexity, jl5| . the information in an individual 
' object. The relation between the individual data and its 

explanation (model) is expressed by Kolmogorov's struc- 
ture function. This function, its variations and its relation 
to model selection, have obtained some notoriety [22], HJ, 
E3, 0, d, E3, EHl, Cni, CSl, li, El, but it has not 
before been comprehensively analyzed and understood. It 
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has often been questioned why Kolmogorov chose to fo- 
cus on the the mysterious function hx below, rather than 
on the more evident /3x variant below. The only written 
record by Kolmogorov himself is the following abstract [TB| 
(translated from the original Russian by L.A. Levin): 

"To each constructive object corresponds a function 
^x{k) of a natural number k — the log of minimal cardi- 
nality of i-containing sets that allow definitions of com- 
plexity at most k. If the element x itself allows a simple 
definition, then the function $ drops to 1 even for small 
k. Lacking such definition, the element is "random" in a 
negative sense. But it is positively "probabilistically ran- 
dom" only when function <I> having taken the value $0 at 
a relatively small fc = fco, then changes approximately as 
$(fc)-$o-(fc-fco)." 

These pregnant lines will become clear on reading this 
paper, where we use "hx" for the structure function "$2^." 
Our main result establishes the importance of the structure 
function: For every data item, and every complexity level, 
minimizing a two-part code, one part model description 
and one part data-to-model code (essentially a constrained 
two-part MDL estimator ^HDi over the class of models 
of at most the given complexity, with certainty (and not 
only with high probability) selects models that in a rig- 
orous sense are the best explanations among the contem- 
plated models. The same holds for minimizing the one-part 
code consisting of just the data-to- model code (essentially a 
constrained maximum likelihood estimator). The explana- 
tory value of an individual model for particular data, its 
goodness of fit, is quantified by by the randomness defi- 
ciency (|II.6(I expressed in terms of Kolmogorov complex- 
ity: minimal randomness deficiency implies that the data 
is maximally "random" or "typical" for the model. It turns 
out that the minimal randomness deficiency of the data in 
a complexity-constrained model class cannot be computa- 
tionally monotonically approximated (in the sense of Def- 
inition ^^^Q) up to any significant precision. Thus, while 
we can monotonically approximate (in the precise sense of 



Section lVIII|) the minimal length two-part code, or the one- 
part code, and thus monotonically approximate implicitly 
the best fitting model, we cannot monotonically approxi- 
mate the number expressing the goodness of this fit. But 
this should be sufficient: we want the best model rather 
than a number that measures its goodness. 

A. Randomness in the Real World 

Classical statistics investigates real- world phenomena us- 
ing probabilistic methods. There is the problem of what 
probability means, whether it is subjective, objective, or 
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exists at all. P.S. Laplace conceived of the probability of 
a physical event as expressing lack of knowledge concern- 
ing its true deterministic causes jl2j . A. Einstein rejected 
physical random variables as well "I do not believe that 
the good Lord plays dice." But even if true physical ran- 
dom variables do exist, can we assume that a particular 
phenomenon we want to explain is probabilistic? Suppos- 
ing that to be the case as well, we then use a probabilistic 
statistical method to select models. In this situation the 
proven "goodness" of such a method is so only in a proba- 
bilistic sense. But for current applications, the total prob- 
ability concentrated on potentially realizable data may be 
negligible, for example, in complex video and sound data. 
In such a case, a model selection process that is successful 
with high probability may nonetheless fail on the actually 
realized data. Avoiding these difficulties, Kolmogorov's 
proposal strives for the firmer and less contentious ground 
of finite combinatorics and effective computation. 

B. Statistics and Modeling 

Intuitively, a central task of statistics is to identify the 
true source that produced the data at hand. But suppose 
the true source is 100,000 fair coin flips and our data is the 
outcome 00 ... 0. A method that identifies flipping a fair 
coin as the cause of this outcome is surely a bad method, 
even though the source of the data it came up with happens 
to be the true cause. Thus, for a good statistical method 
to work well we assume that the data are "typical" for the 
source that produced the data, so that the source "fits" the 
data. The situation is more subtle for data like 0101 . . . 01. 
Here the outcome of the source has an equal frequency of 
Os and Is, just like we would expect from a fair coin. But 
again, it is virtually impossible that such data are produced 
by a fair coin flip, or indeed, independent flips of a coin of 
any particular bias. In real- world phenomena we cannot 
be sure that the true source of the data is in the class of 
sources considered, or, worse, we are virtually certain that 
the true source is not in that class. Therefore, the real ques- 
tion is not to find the true cause of the data, but to model 
the data as well as possible. In recognition of this, we often 
talk about "models" instead of "sources," and the contem- 
plated "set of sources" is called the contemplated "model 
class." In traditional statistics "typicality" and "fitness' 
are probabilistic notions tied to sets of data and models of 
large measure. In the Kolmogorov complexity setting we 
can express and quantify "typicality" of individual data 
with respect to a single model, and express and quantify 
the "fitness" of an individual model for the given data. 

II. Preliminaries 

Let x,y, z € Af, where Af denotes the natural numbers 
and we identify Af and {0, 1}* according to the correspon- 
dence 

(G,e), (1,0), (2,1), (3, 00), (4, 01),... 

Here e denotes the empty word. The length \x\ of a: is the 
number of bits in the binary string x, not to be confused 
with the cardinality \S\ of a finite set S. For example. 



|010| ^ 3 and |e| = 0, while |{0, 1}"| = 2" and |0| = 0. 
The emphasis is on binary sequences only for convenience; 
observations in any alphabet can be so encoded in a way 
that is 'theory neutral'. Below we will use the natural 
numbers and the binary strings interchangeably. 

A. Self- delimiting Code 

A binary string y is a proper prefix of a binary string x 
if we can write x = yz foi z e. A set {x, . . .} C {0, 1}* 
is prefix-free if for any pair of distinct elements in the set 
neither is a proper prefix of the other. A prefix-free set 
is also called a prefix code and its elements are called code 
words. An example of a prefix code, that is useful later, 
encodes the source word x — xiX2 . . . Xn by the code word 

X = rox. 

This prefix- free code is called self-delimiting, because there 
is fixed computer program associated with this code that 
can determine where the code word x ends by reading it 
from left to right without backing up. This way a com- 
posite code message can be parsed in its constituent code 
words in one pass, by the computer program. (This desir- 
able property holds for every prefix-free encoding of a finite 
set of source words, but not for every prefix- free encoding of 
an infinite set of source words. For a single finite computer 
program to be able to parse a code message the encod- 
ing needs to have a certain uniformity property like the x 
code.) Since we use the natural numbers and the binary 
strings interchangeably, |a;| where x is ostensibly an integer, 
means the length in bits of the self-delimiting code of the 
binary string with index x. On the other hand, |a:;| where x 
is ostensibly a binary string, means the self-delimiting code 
of the binary string with index the length of x. Using 
this code we define the standard self-delimiting code for x 
to be x' = \x\x. It is easy to check that \x\ = 2n + 1 and 
\x'\ = n -\- 21ogn -|- 1. Let (•) denote a standard invertible 
effective one-one encoding from A/" x A/" to a subset of Af. 
For example, we can set {x, y) — x'y or {x, y) = xy. We 
can iterate this process to define (x, {y, z)), and so on. 

B. Kolmogorov Complexity 

For precise definitions, notation, and results see the text 
p4j . Informally, the Kolmogorov complexity, or algorith- 
mic entropy, K{x) of a string x is the length (number of 
bits) of a shortest binary program (string) to compute x 
on a fixed reference universal computer (such as a particu- 
lar universal Turing machine). Intuitively, K(x) represents 
the minimal amount of information required to generate x 
by any effective process. The conditional Kolmogorov com- 
plexity K(x\y) of X relative to y is defined similarly as the 
length of a shortest program to compute x, if y is furnished 
as an auxiliary input to the computation. For technical rea- 
sons we use a variant of complexity, so-called prefix com- 
plexity, which is associated with Turing machines for which 
the set of programs resulting in a halting computation is 
prefix free. We realize prefix complexity by considering a 
special type of Turing machine with a one-way input tape, 
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a separate work tape, and a one-way output tape. Such 
Turing machines are caUed prefix Turing machines. If a 
machine T hahs with output x after having scanned all 
of p on the input tape, but not further, then T{p) = x 
and we call p a program for T. It is easy to see that 
{p : T(p) = x,x G {0, 1}*} is a prefix code. Let Ti, T2, . . . 
be a standard enumeration of all prefix Turing machines 
with a binary input tape, for example the lexicographical 
length-increasing ordered syntactic prefix Turing machine 
descriptions, 14 , and let , 4)2 , ■ ■ ■ be the enumeration of 
corresponding functions that are computed by the respec- 
tive Turing machines (T; computes These functions 
are the partial recursive functions or computable functions 
(of effectively prefix- free encoded arguments). The Kol- 
mogorov complexity of x is the length of the shortest binary 
program from which x is computed. 

Definition II. 1: The prefix Kolmogorov complexity of x 

is 

Kix) = min{|i| + IpI : T,{p) = a;}, (II.l) 

where the minimum is taken over p £ {0, 1}* and i e 
{1, 2, . . .}. For the development of the theory we actually 
require the Turing machines to use auxiliary (also called 
conditional) information, by equipping the machine with 
a special read-only auxiliary tape containing this informa- 
tion at the outset. Then, the conditional version K{x \ y) 
of the prefix Kolmogorov complexity of x given y (as aux- 
iliary information) is is defined similarly as before, and the 
unconditional version is set to K{x) = K{x \ e). 

One of the main achievements of the theory of computa- 
tion is that the enumeration Ti,T2, . . . contains a machine, 
say U = Tu, that is computationally universal in that it can 
simulate the computation of every machine in the enumer- 
ation when provided with its index: U{{y,ip) — Ti{{y,p)) 
for all i,p,y. We fix one such machine and designate it as 
the reference universal prefix Turing machine. Using this 
universal machine it is easy to show K{x \ y) = mmq{\q\ : 
Ui{y,q))=x}. 

A prominent property of the prefix-freeness of K{x) is 
that we can interpret 2~^*^^^ as a probability distribution 
since K{x) is the length of a shortest prefix- free program for 
X. By the fundamental Kraft's inequality, see for example 
PI, 531) know that if Zi, Z2, • ■ • are the code- word lengths 
of a prefix code, then — ^- Hence, 

^2^-^(^)<l. (11.2) 

X 

This leads to the notion of universal distribution — a rig- 
orous form of Occam's razor — which implicitly plays an 
important part in the present exposition. The functions 
K{-) and K{- \ •), though defined in terms of a particular 
machine model, are machine-independent up to an additive 
constant and acquire an asymptotically universal and ab- 
solute character through Church's thesis, from the ability 
of universal machines to simulate one another and execute 
any effective process. The Kolmogorov complexity of an 
individual object was introduced by Kolmogorov 15 as an 



absolute and objective quantification of the amount of in- 
formation in it. The information theory of Shannon (21|, 
on the other hand, deals with average information to com- 
municate objects produced by a random source. Since the 
former theory is much more precise, it is surprising that 
analogs of theorems in information theory hold for Kol- 
mogorov complexity, be it in somewhat weaker form. An 
example is the remarkable symmetry of information prop- 
erty used later. Let x* denote the shortest prefix-free pro- 
gram X* for a finite string x, or, if there are more than one 
of these, then x* is the first one halting in a fixed standard 
enumeration of all halting programs. Then, by definition, 
K{x) = |a;*|. Denote K{x,y) ^ K{{x,y)). Then, 

K{x,y)=K{x)+K{y\x*) + 0{l) (II.3) 
^K{y)+K{x\y*) + 0{1). 

Remark II. 2: The information contained in x* in the 
conditional above is the same as the information in the 
pair (x,K{x)), up to an additive constant, since there are 
recursive functions / and g such that for all x we have 
f{x*) = {x,K{x)) and g{x,K(x)) = x* . On input x* , the 
function / computes x = U{x*) and K{x) = \x*\; and on 
input X, K{x) the function g runs all programs of length 
K{x) simultaneously, round-robin fashion, until the first 
program computing x halts — this is by definition x*. <C> 

C. Precision 

It is customary in this area to use "additive constant c" 
or equivalently "additive 0(1) term" to mean a constant, 
accounting for the length of a fixed binary program, inde- 
pendent from every variable or parameter in the expression 
in which it occurs. In this paper we use the prefix com- 
plexity variant of Kolmogorov complexity for convenience. 
Actually some results, especially Theorem ID. II are eas- 
ier to prove for plain complexity. Most results presented 
here are precise up to an additive term that is logarith- 
mic in the length of the binary string concerned, which 
means that they are valid for plain complexity as well — 
prefix complexity of a string exceeds the plain complexity 
of that string by at most an additive term that is logarith- 
mic in the length of that string. Thus, our use of prefix 
complexity is important for "fine details" only. 

D. Meaningful Information 

The information contained in an individual finite object 
(like a finite binary string) is measured by its Kolmogorov 
complexity — the length of the shortest binary program that 
computes the object. Such a shortest program contains no 
redundancy: every bit is information; but is it meaningful 
information? If we flip a fair coin to obtain a finite binary 
string, then with overwhelming probability that string con- 
stitutes its own shortest program. However, also with over- 
whelming probability all the bits in the string are meaning- 
less information, random noise. On the other hand, let an 
object x be a sequence of observations of heavenly bodies. 
Then x can be described by the binary string pd, where p 
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is the description of the laws of gravity, and d the obser- 
vational parameter setting: we can divide the information 
in X into meaningful information p and accidental informa- 
tion d. The main task for statistical inference and learning 
theory is to distil the meaningful information present in the 
data. The question arises whether it is possible to separate 
meaningful information from accidental information, and 
if so, how. The essence of the solution to this problem is 
revealed when we rewrite (jlLljl as follows: 

K{x) = min{|i| + \p\ : T,{p) = x} (II.4) 

= mm{2\i\ + \p\ + 1 : T,{p) ^ x} 

< mm{\q\ : C/((e, q)) = x} + 2\u\ + 1 

g 

< min{/^(j) + \r\ : Ui{e,far)) ^ T,{r) - x} 

+ 2\u\ + l 

< K{x) + 0{1). 

Here the minima are taken over p,q,r G {0,1}* and 
i,j e {1,2,...}. The last equalities are obtained by us- 
ing the universality of the fixed reference universal pre- 
fix Turing machine f/ = T„ with \u\ = 0(1). The string 
j* is a shortest self-delimiting program of K{j) bits from 
which U can compute j, and subsequent execution of the 
next self-delimiting fixed program a will compute j from 
j. Altogether, this has the effect that U({e,j*ar)) = Tj(r). 
This expression emphasizes the two-part code nature of 
Kolmogorov complexity. In the example 

X = 10101010101010101010101010 

we can encode a; by a small Turing machine printing a spec- 
ified number of copies of the pattern "01" which computes 
X from the program "13." This way, K(x) is viewed as the 
shortest length of a two-part code for x, one part describing 
a Turing machine, or model, for the regular aspects of x, 
and the second part describing the irregular aspects of x in 
the form of a program to be interpreted by T. The regular, 
or "valuable," information in x is constituted by the bits 
in the "model" while the random or "useless" information 
of X constitutes the remainder. 

E. Data and Model 

To simplify matters, and because all discrete data can be 
binary coded, we consider only finite binary data strings x. 
Our model class consists of Turing machines T that enu- 
merate a finite set, say S*, such that on input « < |S'| we 
have T{i) = x with x the ith element of T's enumeration of 
S, and T{i) is a special undefined Maine if i > \S\. The "best 
fitting" model for a; is a Turing machine T that reaches the 
minimum description length in ljll.4(l . Such a machine T 
embodies the amount of useful information contained in x, 
and we have divided a shortest program x* for x into parts 
X* = T*i such T* is a shortest self-delimiting program for 
T . Now suppose we consider only low complexity finite-set 
models, and under these constraints the shortest two-part 
description happens to be longer than the shortest one-part 



description. Does the model minimizing the two-part de- 
scription still capture all (or as much as possible) meaning- 
ful information? Such considerations require study of the 
relation between the complexity limit on the contemplated 
model classes, the shortest two-part code length, and the 
amount of meaningful information captured. 

F. Kolmogorov's Structure Functions 

We will prove that there is a close relation between func- 
tions describing three, a priori seemingly unrelated, aspects 
of modeling individual data by models of prescribed com- 
plexity: optimal fit, minimal remaining randomness, and 
length of shortest two-part code, respectively (Figure^. 
We first need a definition. Denote the complexity of the 
finite set S by K{S) — the length (number of bits) of the 
shortest binary program p from which the reference uni- 
versal prefix machine U computes a listing of the elements 
of S and then halts. That is, if 5 = {xi, . . . , a;„}, then 
U{p) — {xi, {x2, ■ ■ ■ , {xn^i, Xn) ■ ■ ■)) ■ The shortest pro- 
gram p, or, if there is more than one such shortest pro- 
gram, then the first one that halts in a standard dovetailed 
running of all programs, is denoted by S* . The conditional 
complexity K(x \ S) of x given S is the length (number 
of bits) in the shortest binary program p from which the 
reference universal prefix machine U computes x from in- 
put S given literally. In the sequel we also use K{x 15*), 
defined as the length of the shortest program that com- 
putes X from input S* . Just like in Remark lII.2l the input 
S* has more information, namely all information in the 
pair {S,K{S)), than just the literal list S. Furthermore, 
K{S I x) is defined as the length of the shortest program 
that computes S from input x, and similarly we can de- 
fine K{S* I x),K{S I X*). For every finite set S C {0, 1}* 
containing x we have 

i^(x|5)<log|5|+0(l). (II.5) 

Indeed, consider the selfdelimiting code of x consisting of 
its [log 151] bit long index of x in the lexicographical order- 
ing of S. This code is called data-to-model code. Its length 
quantifies the maximal "typicality," or "randomness," data 
(possibly different from x) can have with respect to this 
model. The lack of typicality of x with respect to S is 
measured by the amount by which K{x \ S) falls short 
of the length of the data-to-model code. The randomness 
deficiency of x in S* is defined by 

6{x I S) = log |5| - K{x I 5), (II.6) 

for X G S, and oo otherwise. 

"Best Fit" function: The minimal randomness defi- 
ciency function is 

(3^{a) = mm{(5(x \ S) : S 3 x, K{S) < a}, (II.7) 

where we set min0 = oo. The smaller d{x \ S) is, the 
more x can be considered as a typical member of S. This 
means that a set S for which x incurs minimal deficiency, 
in the model class of contemplated sets of given maximal 
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Kolmogorov complexity, is a "best fitting" model for x in 
that model class — a most likely explanation, and (3x{(^) can 
be viewed as a constrained best fit estimator. If the ran- 
domness deficiency is close to 0, then are no simple special 
properties that single it out from the majority of elements 
in 5*. This is not just terminology: If S{x \ S) is small 
enough, then x satisfies all properties of low Kolmogorov 
complexity that hold with high probability for the elements 
of S. To be precise: Consider strings of length n and let S 
be a subset of such strings. A property P represented by 
S is a subset of S, and we say that x satisfies property P 
a X G P. Often, the cardinality of a family of sets {S} we 
consider depends on the length n of the strings in S. We 
discuss properties in terms of bounds S{n) < log \ S\. (The 
lemma below can also be formulated in terms of probabil- 
ities instead of frequencies if we are talking about a prob- 
abilistic ensemble S.) 

Lemma II.3: Let S C {0, 1}". 

(i) If P is a property satisfied by dl\ x € S with 5{x \ 
S) < S{n), then P holds for a fraction of at least 1 - 1/2'^(") 
of the elements in S. 

(ii) Let n and S be fixed, and let P be any property that 
holds for a fraction of at least 1 — 1/2^^^'^ of the elements 
of S. There is a constant c, such that every such P holds 
simultaneously for every a; G SwithS{x \ S) < 5{n)—K{P \ 
S)-c. 

Proof: (i) There are only X]'°fo'^' '^'^"^ programs 
of length not greater than log \S\ — 5{n) and there are l^l 
elements in S. 

(ii) Suppose P does not hold for an object a; G 5 and the 
randomness deficiency satisfies 5{x\S) < S{n) — K{P\S) — c. 
Then we can reconstruct a; from a description of P, which 
can use S, and a;'s index j in an effective enumeration of 
all objects for which P doesn't hold. There are at most 
|^|y2'5(") such objects by assumption, and therefore there 
are constants ci , C2 such that 

K{x\S) < log J + ci < log 151 - d{n) + cj. 

Hence, by the assumption on the randomness deficiency of 
X, we find K{P\S) < C2 — c, which contradicts the necess- 
sary nonnegativity of K{P\S) if we choose c > C2. ■ 
Example II. 4: Lossy Compression The function (3x{oi) 
is relevant to lossy compression (used, for instance, to com- 
press images). Assume we need to compress x to a bits 
where a ^ K{x). Of course this implies some loss of in- 
formation present in x. One way to select redundant in- 
formation to discard is as follows: Find a set S 3 x with 
K(S) < a and with small 6{x\S), and consider a com- 
pressed version S' of S. To reconstruct an x' , a decom- 
presser uncompresses S' to S and selects at random an 
element x' of S. Since with high probability the random- 
ness deficiency of x' in S is small, x' serves the purpose 
of the message x as well as does x itself. Let us look at 
an example. To transmit a picture of "rain" through a 
channel with limited capacity a, one can transmit the in- 
dication that this is a picture of the rain and the particular 
drops may be chosen by the receiver at random. In this 
interpretation, Pxict) indicates how "random" or "typical" 



X is with respect to the best model at complexity level a — 
and hence how "indistinguishable" from the original x the 
randomly reconstructed x' can be expected to be. The re- 
lation of the structure function to lossy compression and 
rate-distortion theory is the subject of an upcoming paper 
by the authors. (} 
"Structure" function: The original Kolmogorov 
structure function |17) , |15) for data x is defined as 

hx{a) = mm{log \S\ : S 3 x, K{S) < a}, {11.8) 

where S 3 x is a contemplated model for x, and a is a 
nonnegative integer value bounding the complexity of the 
contemplated S's. Clearly, this function is non-increasing 
and reaches log |{a;}| = for a — K{x) + c\ where c\ is the 
number of bits required to change x into \x\. The function 
can also be viewed as a constrained maximum likelihood 
(ML) estimator, a viewpoint that is more evident for its 
version for probability models. Figure El in Appendix El 
For every S 3 x we have 

K{x) <K{S)+log\S\ + Oil). (IL9) 

Indeed, consider the following two-part code for x: the 
first part is a shortest self-delimiting program p of S and 
the second part is [loglS"]] bit long index of x in the 
lexicographical ordering of S. Since S determines loglS"! 
this code is self-delimiting and we obtain pi.9|l where 
the constant 0{1) is the length of the program to recon- 
struct X from its two-part code. We thus conclude that 
K{x) < a + hx{a) + 0{l), that is, the function hx{a) never 
decreases more than a fixed independent constant below 
the diagonal sufficiency line L defined by L{a) + a = K(x), 
which is a lower bound on hx{a) and is approached to 
within a constant distance by the graph of for certain 
Qf's (for instance, for a — K{x) + ci). For these a's we 
have a + hx{a) = K{x) + 0{1) and the associated model 
(witness for hx{a)) is called an optimal set for x, and its 
description of < a bits is called a sufficient statistic. If no 
confusion can result we use these names interchangeably. 
The main properties of a sufficient statistic are the follow- 
ing: If S* is a sufficient statistic for x, then K{S) + \og \S\ = 
K{x) + 0{1). That is, the two-part description of x us- 
ing the model S and as data-to-model code the index of 
X in the enumeration of S in loglS"! bits, is as concise 
as the shortest one-part code of x in K{x) bits. Since 
now K{x) < K{x, S) + 0(1) < K{S) + K{x\S) + 0(1) < 
/-i:(S')+log|S'|+0(l) < K{x) + 0{1), using straightforward 
inequalities (for example, given 5 9 x, we can describe x 
self-delimitingly in logjS'l -|- 0(1) bits) and the sufficiency 
property, we find that K{x\S) = \og\S\ + 0(1). There- 
fore, the randomness deficiency of a; in S* is constant, x is 
a typical element for 5, and 5' is a model of best fit for 
X. The data item x can have randomness deficiency about 
0, and hence be a typical element for models S that are 
not sufficient statistics. A sufficient statistic S for x has 
the additional property, apart from being a model of best 
fit, that K{x, S) = K{x) + 0(1) and therefore by l|TL3l) we 
have K{S\x*) = 0(1): the sufficient statistic S* is a model 
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of best fit that is almost completely determined by x. The 
sufficient statistic associated with the least such a is called 
the minimal sufficient statistic. For more details see 
[ini and Section El 

"Minimal Description Length" function: The 
length of the minimal two-part code for x consisting of 
the model cost K{S) and the length of the index of x in 
S, in the model class of sets of given maximal Kolmogorov 
complexity a, the complexity of S upper bounded by a, is 
given by the MDL function or constrained MDL estimator. 

X^{a) = mm{A(5) : S 3 x, K{S) < a}, (11.10) 

where A(5) = loglS*] + K{S) > K{x) - 0(1) is the to- 
tal length of two-part code of x with help of model S. 
Apart from being convenient for the technical analysis in 
this work, \x{cy) is the celebrated two-part Minimum De- 
scription Length code length (Section fy-Bp as a function of 
a, with the model class restricted to models of code length 
at most a. 

III. Overview of Results 

A. Background and Related Work 

There is no written version, apart from the few lines 
which we reproduced in Section of A.N. Kolmogorov's 
initial proposal [Ifij . |17j for a non-probabilistic approach 
to Statistics and Model Selection. We thus have to rely on 
oral history, see Appendix El There, we also describe an 
early independent related result of L.A. Levin Re- 
lated work on so-called "non-stochastic objects" (where 
hx[ct) + a drops to K{x) only for large a) is [23, 
1231, El, E3- In 1987, EZl, EHI, v. v. Vyugin established 
that, for a = o(|a;|), the randomness deficiency function 
(3x{cy) can assume all possible shapes (within the obvious 



constraints). In the survey [51 of Kolmogorov's work in in- 
formation theory, the authors preferred to mention j3x{c^)^ 
because it by definition optimizes "best fit," rather than 
hx (a) of which the usefulness and meaningfulness was mys- 
terious. But Kolmogorov had a seldom erring intuition: we 
will show that his original proposal hx in the proper sense 
incorporates all desirable properties of Px {o) , and in fact is 
superior. In [3], [H], |S] a notion of "algorithmic sufficient 
statistics" , derived from Kolmogorov's structure function, 
is suggested as the algorithmic approach to the probabilis- 
tic notion of sufficient statistic [7] , |S| that is central in clas- 
sical statistics. The paper JH] investigates the algorithmic 
notion in detail and formally establishes such a relation. 
The algorithmic (minimal) sufficient statistic is related in 
|24|. [11. to the "minimum description length" principle 
|19j. 12 , 1301 in statistics and inductive reasoning. More- 
over, observed that fix{a) < hx{a) +a — K{x) + 0(1), 
establishing a one-sided relation between (jll.7|l and ljll.8|) . 
and the question was raised whether the converse holds. 

B. This Work 

When we compare statistical hypotheses 5'o and 5*1 to ex- 
plain data X of length n, we should take into account three 
parameters: K{S),K{x \ S), and log|S'|. The first param- 
eter is the simplicity of the theory S explaining the data. 
The difference d{x\S) = loglS*] - K{x \ S) (the random- 
ness deficiency) shows how typical the data is with respect 
to S. The sum A(5) = K{S) + log\S\ tells us how short 
the two part code of the data using theory S is, consist- 
ing of the code for S and a code for x simply using the 
worst-case number of bits possibly required to identify x 
in the enumeration of S. This second part consists of the 
full-length index ignoring savings in code length using pos- 
sible non-typicality of a; in S* (like being the first element 
in the enumeration of S). We would like to define that 5*0 
is not worse than 5*1 (as an explanation for x), in symbols: 
^o < 5"!, if 
. K{So) < K{Si); 
• Six\So) < S{x\Si)- and 
. A(5o) < A{Si). 

To be sure, this is not equivalent to saying that K{So) < 
K{Si),6{x\Sq) < (5(x|5i),log|5o| < loglS*!]. (The latter 
relation is stronger in that it implies Sq < Si but not vice 
versa.) The algorithmic statistical properties of a data 
string X are fully represented by the set Ax of all triples 
{K{S),S{x\S), A{S)) such that S 3 x, together with a com- 
ponent wise order relation < on the elements those triples. 
The complete characterization of how this set may look 
like (with 0(log n)-accuracy) is now known in the follow- 
ing sense. 

Our results (Theorems IIV.4I ITOl IIV.11|) describe com- 
pletely (with O(logn)-accuracy) possible shapes of the 
closely related set Bx consisting of all triples (a, /3, A) such 
that there is a set 9 a: with K{S) < a, 5{x \ S) < (}, 
A(S') < A. That is. Ax C Bx and Ax and Bx have the 
same minimal triples. Hence, we can informally say that 
our results describe completely possible shapes of the set of 
triples {K{S),6{x\S), K{S)) for non-improvable hypotheses 
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S explaining x. For example up to O(logn) accuracy, and 
denoting k = K{x) and n = \x\: 

(i) For every minimal triple (a, (3, 7) in we have < 
a <k, < P,l3 + k = -f <n. 

(ii) There is a triple of the form {aQ,0,k) in (the 
minimal such ao is the complexity of the minimal sufhcient 
statistic for x). This property allows us to recover the 
complexity fc of x from B^- 

(iii) There is a triple of the form (0, Aq — k, Xq) in B^ 
with Aq < n. 

Previously, a limited characterization was obtained by 
V'yugin |27], [2H| for the possible shapes of the projection 
of Bx on a, /J-coordinates but only for the case when a — 
o{K{x)). Our results describe possible shapes of the entire 
set Bx for the full domain of a (with O(logri)-accuracy). 
Namely, let / be a non-increasing integer valued function 
such that /(O) < n, J{i) — k for alH > fc and 

Bf = {(a,/3,A) I < a, /(a) < A, /(a) - fc < /?}. 

For every x of length n and complexity fc there is / such 
that 

Bf+ucB^cBf-u (III.l) 

where u = (clog n, clog n, clog n) for some universal con- 
stant c. Conversely, for every k < n and every such / 
there is x of length n such that holds for u ~ 

{clogK{nJ,k),clogK{n,f,k),c\ogK{nJ,k)) . Our re- 
sults imply that the set B^ is not computable given x, fc 
but is computable given x,k and oq, the complexity of 
minimal sufficient statistic. 

Remark III.l: There is also the fourth important param- 
eter, K{S I X*) reflecting the determinacy of model S by 
the data x. However, the equality log \ S\ + K{S) — K{x) — 
K{S I X*) + S{x I S) + 0(1) shows that this parameter can 
be expressed in a, /3, h. The main result (|III.2|I establishes 
that K{S I X*) is logarithmic for every set S witnessing 
hx{a). This also shows that there are at most polynomi- 
ally many such sets. <0> 

C. Technical Details 

The results are obtained by analysis of the relations be- 
tween the structure functions. The most fundamental re- 
sult in this paper is the equality 

Pxia) ^K{a)+a- K{x) = A, (a) - K{x) (111.2) 

which holds within additive terms, that are logarithmic in 
the length of the string x, in argument and value. Ev- 
ery set S that witnesses the value hx{a) (or \x{a))^ also 
witnesses the value (3x(a) (but not vice versa). It is easy 
to see that hx{a) and Xxia) are upper semi-computable 
fDefinition IVII. l|l : but we show that Px{ct) is neither up- 
per nor lower semi-computable. A priori there is no reason 
to suppose that a set that witnesses hx{a) (or Xxia)) also 
witnesses Pxi,ci), for every a. But the fact that they do, vin- 
dicates Kolmogorov's original proposal and establishes /i^'s 
pre-eminence over (3x. The result can be taken as a foun- 
dation and justification of common statistical principles in 



model selection such as maximum likelihood or MDL ([l^j, 
|2] and our Sections lV-Bl and lV-(](l . We have also addressed 
the fine structure of the shape of hx (especially for a below 
the minimal sufficient statistic complexity) and a uniform 
(noncomputable) construction for the structure functions. 

The possible (coarse) shapes of the functions A^,, hx and 
Px are examined in Section IIVI Roughly stated: The 
structure functions A^, , hx and Px can assume all possible 
shapes over their full domain of definition (up to additive 
logarithmic precision in both argument and value). As a 
consequence, so-called "non-stochastic" strings x for which 
hx{c() + o. stabilize on K{x) for large a are common. This 
improves and extends V'yugin's result [2Z|, |2H| above; it 
also improves the independent related result of L.A. Levin 
1131 in Appendix 1X1 and, applied to "snooping curves" ex- 
tends a recent result of V'yugin, [221, in Section IV-AI The 
fact that Xx can assume all possible shapes over its full 
domain of definition establishes the significance of (|III.2|) . 
since it shows that Xx{a) S> K{x) indeed happens for some 
X, a pairs. In that case the more or less easy fact that 
(3x{a:) = for Xx{o:) = K{x) is not applicable, and a pri- 
ori there is no reason for (|III.2|I : Why should minimizing 
a set containing x plus the set's description length also 
minimize x's randomness deficiency in the set? But (jIII.2|l 
shows that it does! We determined the (fine) details of the 
function shapes in Section IVII (Non-)computability prop- 
erties are examined in Section IVIII incidentally proving a 
to our knowledge first natural example, Px, of a function 
that is not semi-computable but computable with an ora- 
cle for the halting problem. In Section rVIIII we exhibit a 
uniform construction for sets realizing hx{a) for all a. 

D. Probability Models 

Following Kolmogorov we analyzed a canonical setting 
where the models are finite sets. As Kolmogorov himself 
pointed out, this is no real restriction: the finite sets model 
class is equivalent, up to a logarithmic additive term, to the 
model class of probability density functions, as studied in 
|22| . |lUj . and the model class of total recursive functions, 
as studied in [21], see Appendix IbI 

E. All Stochastic Properties of the Data 

The result (IIII.2|I shows that the function hx (a) yields all 
stochastic properties of data x in the following sense: for 
every a the class of models of maximal complexity a has 
a best model with goodness-of-fit determined by the ran- 
domness deficiency Px{a) = hx{a)+a—K{x) — the equality 
being taken up to logarithmic precision. For example, for 
some value ao the minimal randomness deficiency Pxio:) 
may be quite large for a < (so the best model in that 
class has poor fit), but an infinitessimal increase in model 
complexity may cause Pxict) to drop to zero (and hence 
the marginally increased model class now has a model of 
perfect fit), see Figure ^ Indeed, the structure function 
quantifies the best possible fit for a model in classes of ev- 
ery complexity. 
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F. Used Mathematics 

Kolmogorov's proposal for a nonprobabilistic statistic is 
combinatorial and algorithmic, rather than probabilistic. 
Similar to other recent directions in information theory 
and statistics, this involves notions and proof techniques 
from computer science theory, rather than from proba- 
bility theory. But the contents matter and results are 
about traditional statistic- and information theory notions 
like model selection, information and compression; conse- 
quently the treatment straddles fields that are not tradi- 
tionally intertwined. For convenience of the reader who is 
unfamiliar with algorithmical notions and methods we have 
taken pains to provide intuitive explanations and interpre- 
tations. Moreover, we have delegated almost all proofs to 
Appendix[ni and all precise formulations and proofs of the 
(non)computability and (non)approximability of the struc- 
ture functions to Appendix IdI 

IV. Coarse Structure 

In classical statistics, unconstrained maximal likelihood 
is known to perform badly for model selection, because it 
tends to want the most complex models possible. A pre- 
cise quantification and explanation of this phenomenon, in 
the complexity-constrained model class setting, is given in 
this section. It is easy to see that unconstrained maximiza- 
tion will result in the singleton set model {x} of complex- 
ity about K{x). We will show that the structure function 
/la; (a) tells us all stochastic properties of data x. From 
complexity up to the complexity where the graph hits the 
sufficiency line, the best fitting models do not represent all 
meaningful properties of x. The distance between hx{a) 
and the sufficiency line L{a) = K{x) — a, is a measure, 
expressed by I3x{a), of how far the best fitting model at 
complexity a falls short of a sufficient fitting model. The 
least complex sufficient fitting model, the minimal suffi- 
cient statistic, occurs at complexity level ao where hits 
the sufficiency line. There, hx{aQ) -f ao = K{x). The 
minimal sufficient statistic model expresses all meaning- 
ful information in x, and its complexity is the number of 
bits of meaningful information in the data x. The remain- 
der hxioifj) bits of the K{x) bits of information in data 
X is the "noise," the meaningless randomness, contained 
in the data. When we consider the function at still 
higher complexity levels a > a^, the function hx{oi) hugs 
the sufficiency line L[q) = K{x) — a, which means that 
hx{c() + Oi stays constant at K(x). The best fitting models 
at these complexities start to model more and more noise, 
hx{pi) — hxicta) = ao — a bits, in the data x: the added 
complexity ao — a in the sufficient statistic model at com- 
plexity level a over that of the minimal sufficient statistic 
at complexity level ao is completely used to model increas- 
ing part of the noise in the data. The worst overfitting 
occurs when we arrive at complexity K(x), at which point 
we model all noise in the data apart from the meaningful 
information. Thus, our approach makes the fitting pro- 
cess of constrained maximum likelihood, first underfitting 
at low complexity levels of the models considered, then 
the complexity level of optimal fit (the minimal sufficient 



statistic), and subsequently the overfitting at higher levels 
of complexity of models, completely and formally explicit 
in terms of fixed data and individual models. 

A. All Shapes are Possible 

Let Px{oi) be defined as in ljll.7|) and hx{a) be defined 
as in l|II.8(l . Both functions are (/?£c(a) may be —0(1)) 
for all a > K{x) + cq where cq is a constant. We represent 
the coarse shape of these functions for different x by func- 
tions characteristic of that shape. Informally, g represents 
/ means that the graph of / is contained in a strip of loga- 
rithmic (in the length n of x) width centered on the graph 
of g, Figure 121 

Intuition: f follows g up to a prescribed precision. 

For formal statements we rely on the notion in Defini- 
tion Informally, we obtain the following results {x is 
of length n and complexity K{x) = k): 

• Every non-increasing function (3 represents f3x for some 
X, and for every x the function Px is represented by some 
/3, provided /3(fc) = 0, /3(0) <n-k. 

• Every function h, with non-increasing h{a) + a, repre- 
sents hx for some x, and for every x the function hx is rep- 
resented by some h as above, provided h{k) = 0, h{0) < n 
(and by the non-increasing property h{0) > k). 

• hx{a) + a represents (3x{a) + fc, and conversely, for every 

X. 

m For every x and a, every minimal size set S 3 x oi com- 
plexity at most a' = a + O(logn), has randomness defi- 
ciency Pxia') < S{x I S) < f3xia) + O(logn). 

To provide precise statements we need a definition. 

Definition IV. 1: Let f,g be functions defined on 
{0, 1, . . . ,k} with values in N U {oo}. We say that / is 
{e{i),S{i))-c\ose to g (in symbols: / = S{g}) if 

/(») > mm{g{j) : j G [e(0), k], \j - i\ < £(z)} - ,5(z)}, 
/(i) <max{g(j): je [e(0),fc], U - i| < e(i)} <5(i)} 

for every i £ [e(0), k]. If / = £{g) and g — £{f) we write 

f = g- 

Here e{i),S{i) are small values like O(logn) when we 
consider data x of length n. Note that this definition is 
not symmetric and allows f{i) to have arbitrary values for 
i € [0, e(0)). However, it is transitive in the following sense: 
if / is (ei(j): (5i(i))-close to g and 5 is {e2{i),S2{i))-close to h 
then / is {£i{i) + £2(1), Si{i) + S2{i))-c\ose to h. If / = £{g) 
and g is linear continuous, meaning that \g{i) — g{j)\ < 
c|z— j'l for some constant c, then the difference between f(i) 
and g{i) is bounded by ce{i) + S{i) for every e(0) < i < k. 

This notion of closeness, if applied unrestricted, is not 
always meaningful. For example, take as g the function 
taking value n for all even i € [0, k] and for all odd i € 
[0, k]. Then for every function / on [0, k] with f{i) € [0, n] 
we have / — £[g) for e = 1, 5 = 0. But if / = £{g) and 
g is non-increasing then g indeed gives much information 
about /. 

It is instructive to consider the following example. Let 
g{i) be equal to 2fc — i for i = 0, 1, . . . , | — 1 and to k — i 
for i — |, . . . , /c. Let e{i), S{i) be constant. Then a func- 
tion / = £{g) may take every value for i G [0,e), every 
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Structure function hx{a) in strip determined by h{a), that 
IS, h^{a) = e{h{a)). 



value in [2k — i — 2S,2k — i + 26] for i e [e, | — 5], every 
value in [/c — i — (5, 2fc — i + ^] for i e (| — (5, 1 + (5], and 
every value in [k — i — 26, k — i + 26] for i S ( | + 5 , k] (see 
Figure |2Jl- Thus the point | of discontinuity of g gives an 
interval of size 26 of large ambiguity of /. Loosely speak- 
ing the graph of / can be any function contained in the 
strip of radius 26 whose middle line is the graph of g. For 
technical reasons it is convenient to use, in place of h^, the 
MDL function (lILlOjl . The definition of A^; immediately 
implies the following properties: Xxia) is non-increasing, 
K{a) > K{x) - 0(1) for all a. 

The next lemma shows that properties of A^ translate 
directly into properties of hx since hx{a) is always "close" 
to Xx{a) — a. 

Lemma IV. 2: For every x we have \x{oi) < hx{a) + a < 
Ax (a) + K{a) + 0{1) for all a. Hence Xx{a) = hx{a) + a 
forego, 6 = K{a) + 0{l). 

Intuition: The functions hx{a) + a (the ML code length 
plus the model complexity) and Xx{ct) (the MDL code 
length) are essentially the same function. 

Remark IV. 3: The lemma implies that the same set wit- 
nessing hx{cy) also witnesses Xxicn) up to an additive term 
of K{a). The converse is only true for the smallest cardinal- 
ity set witnessing Xx{a). Without this restriction a counter 
example is: for random x € {0, 1}" the set S = {0, 1}" 
witnesses Xx{^) — Ji + 0{K[n)) but does not witness 
= i + 0{K{n)). (If Xx{a) = K{x), then every 
set of complexity a' < a witnessing Xx{a') ~ K{x) also 
witnesses Xx{a) = K{x).) 

The next two theorems state the main results of this 
work in a precise form. By K(i,n,X) we mean the mini- 
mum length of a program that outputs n, i, and computes 
A(j) given any j in the domain of A. We first analyze the 



possible shapes of the structure functions. 

Theorem IV. 4: (i) For every n and every string x of 
length n and complexity k there is an integer valued non- 
increasing function A defined on [0, k] such that A(0) < n, 
X{k) = k and A^^ = ^(A) for e^6 = K{n) + 0(1). 

(ii) Conversely, for every n and non-increasing integer 
valued function A whose domain includes [0, k] and such 
that A(0) < n and A(fc) = fc, there is x of length n and 
complexity k ± {K{k, n, A) -I- 0(1)) such that A^ = £{X) for 
e = 6 = K{i,n,X) + 0{1). 

Intuition: The MDL code length Xx, and therefore by 
Lemma jlV.Sl also the original structure function hx, can 
assume essentially every possible shape as a function of the 
contemplated maximal model complexity. 

Remark IV. 5: The theorem implies that for every func- 
tion h{i) defined on [0, k] such that the function A(i) = 
h{i) -\- i satisfies the conditions of item (ii) there is an x 
such that hx{i) = £{h{i)) with e — 6 — 0{K{i, n, h)). <^ 

Remark IV. 6: The proof of the theorem shows that for 
every function X{i) satisfying the conditions of item (ii) 
there is x such that Xx(i | n, A) = E(X(i)) with e — 6 — 
K{i) + 0{1) where the conditional structure function Xx{i \ 
y) = mms{K{S \ y)+\og\S\ : S ^ x, K{S \ y) < i}. 
Consequently, for every function h{i) such that the function 
X{i) = h{i) + i satisfies the conditions of item (ii) there 
is an X such that hx{i \ n,h) — £{h{i)) with e — 6 — 
0{K{i)) where the conditional structure function hx{i \ 
y) = minsOog |5| : S ^ x, K{S \ y) < i}. <> 

Remark IV. 7: In the proof of Item (ii) of the theorem 
we can consider every finite set U with |C/| > 2" in place of 
the set A of all strings of length n. Then we obtain a string 
X e U such that A^^ = £(A) with e{i) = 6{i) = K{i, U, A). 



B. Selection of Best Fitting Model 

Recall that in classical statistics a major issue is whether 
a given model selection method works well if the "right" 
model is in the contemplated model class, and what model 
the method selects if the "right" model is outside the model 
class. We have argued earlier that the best we can do 
is to look for the "best fitting" model. But both "best 
fitting" and "best fitting in a constrained model class" are 
impossible to express classically for individual models and 
data. Instead, one focusses on probabilistic definitions and 
analysis. It is precisely these issues that can be handled in 
the Kolmogorov complexity setting. 

For the complexity levels a at which hx{oi) coincides with 
the diagonal sufficiency line L{q) = K{x) — a, the model 
class contains a "sufficient" (the "best fitting") model. 

For the complexity levels a at which hx{ci) is above the 
sufficiency line, the model class does not contain a "suffi- 
cient" model. However, our results say that hx{a) — L{o) 
equals the minimal randomness deficiency that can be 
achieved by a model of complexity < a, and hence quanti- 
fies rigorously the properties of the data x such a model can 
represent, that is, the level of "fitness" of the best model 
in the class. 

Semi-computing hx{a) from above, together with the 
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model wittnessing this value, automatically yields the ob- 
jectively most fitting model in the class, that is, the model 
that is closest to the "true" model according to an objective 
measure of representing most properties of data x. 

The following central result of this paper shows that the 
\x (equivalently hx , by Lemma IIV.2(I and [3x can be ex- 
pressed in one another but for a logarithmic additive error. 

Theorem IV. 8: For every x of length n and complexity 
k it holds (3x{oi) + k = \x{a) iov e = 5 = O(logn). 

Intuition: A model achieving the MDL code length A^(a), 
or the ML code length hx{a), essentially achieves the best 
possible fit /3a; (a). 

Corollary IV. 9: For every x of length n and complex- 
ity k < n there is a non-increasing function (3 such that 
/3(0) <n-k, j3{k) = and Px = £{f3) for e, (5 = O(logn). 
Conversely, for every non-increasing function /3 such that 
/3(0) < n—k, /3{k) — there is x of length n and complexity 
k±d such that f3x = f (/3) for e = 5 = 0(log n) + K{(3). 

Proof: The first part is more or less immediate. Or 
use the first part of Theorem IIV.4I and then let j3{i) — 
— k. To prove the second part, use the second part of 
Theorem IIV.8I and the second part of Theorem IIV.4I with 
\{i)=P{i) + k. ■ 

Remark IV. 10: From the proof of Theorem IIV.8I we see 
that for every finite set S" 9 a;, of complexity at most 
a + O(logn) and minimizing A(S'), we have S{x \ S) < 
(ix{ct) + O(logrt). Ignoring O(logn) terms, at every com- 
plexity level, every best hypothesis at this level with re- 
spect to A(S') is also a best one with respect to typicality. 
This explains why it is worthwhile to find shortest two- 
part descriptions for given data x: this is the single known 
way to find an S 3 x with respect to which x is as typi- 
cal as possible at that complexity level. Note that the set 
{{x,S,f3) \ X £ S, S{x I 5) < /?} is not enumerable so we 
are not able to generate such S"s directly fSection lVII(l . 

The converse is not true: not every hypothesis, consisting 
of a finite set, witnessing Pxic^) also witnesses Xx{a) or 
hx{a). For example, let x be a string of length n with 
K{x) > n. Let Si = {0, 1}" U {y} where y is a string of 
length f such that K{x,y) > ^ and let S2 = {0,1}". 
Then both 81,82 witness Px{^ + 0{logn)) = 0(1) but 
A(5i) = f + 0(logn) » A,(f + 0(logn)) -n + 0(logn) 
while logical - n > hxCj + O(logn)) f + 0(log7i). <} 

However, for every a such that Xx{i) decreases when i — > 
a with i < a, a witness set for Pxict) is also a witness set 
for Ax (a) and hx{a). We will call such a critical (with 
respect to x): these are the model complexities at which 
the two-part MDL code-length decreases, while it is stable 
in between such critical points. The next theorem shows, 
for critical a, that for every Ab x with K{A) ft! a and d{x \ 
A) « f3x{ct), we have log|A| « hx{a) and A{A) « Xx{a). 
More specifically, if K{A) « a and S{x \ A) fa Px{cf) but 
A{A) > Xx{a) or log \A\ > hx{a) then there is 8 3 x with 
K{8) < a and A(S') « Xx{a). 

Theorem IV. 11: For &\\ A 3 x there \s 8 3 x such 
that K{8) < Xx{a) + {5{x\A) - px{a)), K{8) < K{A) + 
iXxia)-AiA)) + {d{x\A)-Px{a)), and K{8) < a+{hx{a)- 
log|A|) -t- (^(a;|A) — f3x{ct)) where all inequalities hold up 



to 0(logA(A)) additive term. 

Intuition: Although models of best fit (witnessing (3x{a)) 
do not necessarily achieve the MDL code length Xx{a) or 
the ML code length hx{a), they do so at the model com- 
plexities where the MDL code length decreases, and, equiv- 
alently, the ML code length decreases at a slope of more 
than —1. 

C. Invariance under Recoding of Data 

In what sense is the structure function invariant under 
recoding of the data? Osamu Watanabe suggested the ex- 
ample of replacing the data a; by a shortest program x* for 
it. Since x* is incompressible it is a typical element of the 
set of all strings of length \x*\ = K{x), and hence hx'{a) 
drops to the sufficiency line L{a) = K(x) — a already for 
some a < K{K(x)), so almost immediately (and it stays 
within logarithmic distance of that line henceforth). That 
is, hx*{a) — K{x) — a up to logarithmic additive terms 
in argument and value, irrespective of the (possibly quite 
different) shape of hx. Since the Kolmogorov complexity 
function K{x) = \x*\ is not recursive, (15), the recoding 
function f{x) — x* is also not recursive. Moreover, while / 
is one-one and total it is not onto. But it is the partiality 
of the inverse function (not all strings arc shortest pro- 
grams) that causes the collapse of the structure function. 
If one restricts the finite sets containing x* to be subsets 
of {y* : y £ {0, 1}*}, then the resulting structure function 
hx* is within a logarithmic strip around hx. However, the 
structure function is invariant under "proper" recoding of 
the data. 

Lemma IV. 12: Let / be a recursive permutation of the 
set of finite binary strings (one-one, total, and onto). Then, 
hfix) = £{hx) for e, S = Kif) + 0(1). 

Proof: Let 8 3 x he a witness of hx(a). Then, 
8f = {f{y) -.y £ 8} satisfies K{8f) <a + K{f) + 0(1) 
and \8f \ = \8\. Hence, /i/(,)(a + K{f) + 0(1)) < hx{a). 
Let R 9 f{x) be a witness of hfi^x){o). Then, Rf-i = 
{/"^(y) ■■ y & R} satisfies K{Rf-i) < a + K{f) + 0(1) 
and = Hence, hx{a + K{f) + 0{\)) < hf(^x){a) 

(since Kif-^) = K{f) + 0(1)). ■ 

D. Reach of Results 

In Kolmogorov's initial proposal, as in this work, models 
are finite sets of finite binary strings, and the data is one of 
the strings (all discrete data can be binary encoded). The 
restriction to finite set models is just a matter of conve- 
nience: the main results generalize to the case where the 
models are arbitrary computable probability density func- 
tions, 122], QDji and to the model class consisting 
of arbitrary total recursive functions, |25|. We summarize 
the proofs of this below. Since our results hold only within 
additive precision that is logarithmic in the binary length 
of the data, and the equivalences between the model classes 
hold up to the same precision, the results hold equally for 
the more general model classes. 

The generality of the results are at the same time a re- 
striction. In classical statistics one is commonly interested 
in model classes that are partially poorer and partially 
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richer than the ones we consider. For example, the class 
of Bernoulli processes, or fc-state Markov chains, is poorer 
than the class of computable probability density functions 
of moderate maximal Kolmogorov complexity a, in that the 
latter may contain functions that require far more complex 
computations than the rigid syntax of the former classes 
allows. Indeed, the class of computable probability density 
functions of even moderate complexity allows implementa- 
tion of a function mimicking a universal Turing machine 
computation. On the other hand, even the lowly Bernoulli 
process can be equipped with a noncomputable real bias in 
(0, 1), and hence the generated probability density function 
over n trials is not a computable function. This incompa- 
rability of the here studied algorithmic model classes, and 
the traditionally studied statistical model classes, means 
that the current results cannot be directly transplanted to 
the traditional setting. They should be regarded as pris- 
tine truths that hold in a platonic world that can be used 
as guideline to develop analogues in model classes that are 
of more traditional concern, as in |20| . The questions to 
be addressed are: Can these platonic truths say anything 
usable? If we restrict ourselves to statistical model classes, 
how far from optimal are we? Note that in themselves 
the finite set models are not really that far from classical 
statistical models. 

V. Prediction and Model Selection 

A. Best Prediction Strategy 

In |29| the notion of a snooping curve L^ia) of x was 
introduced, expressing the minimal logarithmic loss in pre- 
dicting the consecutive elements of a given individual string 
X, in each prediction using the preceding sequence of ele- 
ments, by the best prediction strategy of complexity at 
most a. 

Intuition: The snooping curve quantifies the quality of 
the best predictor for a given sequence at every possible 
predictor- complexity. 

Formally, Lx(a) = min Losspfa;). The minimum is 

K{P)<a 

taken over all prediction strategies P of complexity at most 
a. A prediction strategy P is a mapping from the set of 
strings of length less than \x\ into the set of rational num- 
bers in the segment [0,1]. The value P{xi...Xi) is re- 
garded as our behef (or probability) that Xi+i = 1 after 
we have observed xi,...,Xi. If the actual bit Xi+i is 1 
the strategy suffers the loss — logp otherwise — log(l — p). 
The strategy is a finite object and K{P) may by de- 
fined as the complexity of this object, or as the mini- 
mum size of a program that identifies n = \x\ and given 
y finds P{y). The notation Lossp(a;) indicates the total 
loss of P on X, i.e. the sum of all n losses: Lossp(a;) = 
ll}ilo^{-'^og\P{xi...Xi)-l + Xi+i\). Thus, the snooping 
curve Lx (a) gives the minimal loss suffered on all of a; by a 
prediction strategy, as a function of the complexity at most 
a of the contemplated class of prediction strategies. The 
question arises what shapes these functions can have — for 
example, whether there can be sharp drops in the loss for 
only minute increases in complexity of prediction strate- 



gies. 

A result of describes possible shapes of but only 
for a = o{n) where n is the length of x. Here we show 
that for every function L and every k < n there is a data 
sequence x such that Lx{a ± O(logn)) = L{a) ± O(logn), 
provided L(0) < n, L{a)+a is non-increasing on [0, fc], and 
L{a) = for a > k. 

Lemma V.l: Lx{a ± O(logri)) = hx{a ± O(logn)) for 
every x and a. Thus, Lemma |IV.2I and Theorem IIV.4I de- 
scribes also the coarse shape of all possible snooping curves. 

Proof: (<) A given finite set A of binary strings of 
length n can be identified with the following prediction 
strategy P: Having read the prefix y of a; it outputs p = 
\Ayi\/\Ay\ where Ay stands for the number of strings in A 
having prefix y. 

It is easily seen, by induction, that Lossp(2/) = 
log(|A|/|Ay I) for every y. Therefore, Lossp(a;) = log \A\ for 
every x & A. Since P corresponds to A in the sense that 
K{P I A) = 0(1), we obtain i^(a -I- O(logn)) < hx{a). 
The term O(logn) is required, because the initial set of 
complexity a might contain strings of different lengths 
while we need to know n to get rid of the strings of lengths 
different from n. 

(>) Conversely, assume that Lossp(x) < m. Let A ~ 
{x £ {0,1}": Lossp(a;) < m}. Since 2"^°""^^''' = 1 

(proof by induction on n), and 2~'^°^^^(^) > for every 
X € A, we can conclude that A has at most 2™ elements. 
Since K{A \ P) = O(logm), we obtain /i^(a + O(logn)) < 
Lx{a). ■ 
Thus, within the obvious constraint of the function Lx{o) + 
a being non-increasing, all shapes for the minimal total loss 
Lx {ol) as a function of the allowed predictor complexity are 
possible. 

B. Foundations of MDL 

(i) Consider the following algorithm based on the Min- 
imum Description Length principle. Given x, the data to 
explain, and a, the maximum allowed complexity of expla- 
nation, we search for programs p of length at most a that 
print a finite set S 3 x. Such pairs {p, S) are possible ex- 
planations. The best explanation is defined to be the (p, S) 
for which 6{x\S) is minimal. Since the function 6{x\S) is 
not computable, we cannot find the best explanation. The 
programs use unknown computation time and thus we can 
never be certain that we have found all possible explana- 
tions. 

To overcome this problem we use the indirect method 
of MDL: We run all programs in dovetailed fashion. At 
every computation step t consider all pairs (p, S) such that 
program p has printed the set 5* containing x by time t. Let 
{pt,Lt) stand for the pair {p,S) such that |p| -I- log|S'| is 
minimal among all these pairs (p, S). The best hypothesis 
Lt changes from time to time due to the appearance of 
a better hypothesis. Since no hypothesis is declared best 
twice, from some moment onwards the explanation [pt, Lt) 
which is declared best does not change anymore. 

Compare this indirect method with the direct one: af- 
ter step t of dovetailing select {p,S) for which log|5| — 
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K*{x\S) is minimum among all programs p that up 
to this time have printed a set 5* containing cc, where 
K'^{x\S) is the approximation of K''-{x\S) obtained af- 
ter t steps of dovetailing, that is, K^{x\S) = Ta\n{\q\ : 
U on input (q, S) prints x in at most t steps}. Let {qt, Bt) 
stand for that model. This time the same hypothesis can 
be declared best twice. However from some moment on- 
wards the explanation (qt,Bt) which is declared best does 
not change anymore. 

Why do we prefer the indirect method to the direct one? 
The explanation is that we have a comparable situation 
in the practice of the real-world MDL, in the analogous 
process of finding the MDL code. There, we deal often with 
t that are much less than the time of stabilization of both 
Lt and Bt. For small t, the model Lt is better than Bt in 
the following respect: Lt has some guarantee of goodness, 
as we know that S{x\Lt) + K{x) < \pt\ + log|Lt| + 0(1). 
That is, we know that the sum of deficiency of x in Lt and 
K{x) is less than some known value. In contrast, the model 
Bt has no guarantee of goodness at all: we do not know any 
upper bound neither for S{x\Bt), nor for S{x\Bt) + K{x). 

Theorem IIV.8I implies that the indirect method of MDL 
gives not only some garantee of goodness but also that, 
in the limit, that guarantee approaches the value it upper 
bounds, that is, approaches 5{x\Lt) + K{x), and 5{x\Lt) 
itself is not much greater than 5{x\Bt) (assuming that a 
is not critical). That is, in the limit, the method of MDL 
will yield an explanation that is only a little worse than the 
best explanation. 

(ii) If 5 3 a; is a smallest set such that K{S) < a, then S 
can be converted into a best strategy of complexity at most 
a, to predict the successive bits of x given the preceding 
ones, (Section IV-A|I . Interpreting "to explain" as "to be 
able to predict well" , MDL in the sense of sets witnessing 
Xx (a) gives indeed a good explanations at every complexity 
level a. 

(ni) In statistical appfications of MDL HH], 0, MML 
|30j . and related methods, one selects the model in a given 
model class that minimizes the sum of the model code 
length and the data-to-model code length; in modern ver- 
sions 12] one selects the model that minimizes just the data- 
to-model code length (ignoring the model code length) . For 
example, one uses data-to-model code — logP(a;) for data 
X with respect to probability (density function) model P. 
For example, if the model is the uniform distribution over 
n-bit strings, then the data-to-model code for x = 00 ... 
is — log 1/2" = n, even though we can compress x to about 
logn bits, without even using the model. Thus, the data- 
to-model code is the worst-case number of bits required for 
data of given length using the model, rather than the op- 
timal number of bits for the particular data at hand. This 
is precisely what we do in the structure function approach: 
the data-to-model cost of x with respect to model A B x 
is log \ A\, the worst-case number of bits required to spec- 
ify an element of A rather than the minimal number of 
bits required to specify x in particular. In contrast, ulti- 
mate compression of the two-part code, which is suggested 
by the "minimum description length" phrase, |24j . means 



minimizing K{A) + K{x\A) over all models A in the model 
class. In Theorem IIV.8I we have essentially shown that 
the "worst-case" data-to-model code above is the approach 
that guarantees the best fitting model. In contrast, the "ul- 
timate compression" approach can yield models that are far 
from best fit. (It is easy to see that this happens only if the 
data are "not typical" for the contemplated model, 24 .) 
For instance, let x be a string of length n and complexity 
about n/2 for which (3x{0{\og{n)) = n/A + 0(log(n). This 
means that the best model at a very low complexity level 
(essentially level within the "logarithmic additive preci- 
sion" which governs our techniques and results) has signifi- 
cant randomness deficiency and hence is far from "optimal" 
or "sufficient". Such strings exist by Corollarv lIV.91 Such 
strings are not the strings of maximal Kolmogorov com- 
plexity, with K{x) > n, such as most likely result from n 
fiips with a fair coin, but strings that must have a more 
complex cause since their minimal sufficient statistic has 
complexity higher than O(logn). Consider the model class 
consisting of the finite sets containing x at complexity level 
a = O(logn). Then for the model Aq = {0,1}" we have 
K{Ao) = O(logn) and K{x\Ao) = n/2 + 0(logn) thus the 
sum K{Ai^) + K{x\Aq) = n/2 -|- O(logn) is minimal up to 
a term O(logn). However, the randomness defficiency of 
X in Aq is about n/2, which is much bigger than the min- 
imum (3x{0{\og{n)) K, n/A. For the model Ai witnessing 
f3x{0{\og{n)) fa n/A we also have K{Ai) — O(logn) and 
K{x\Ai) = n/2 + O(logn). However, it has smaller cardi- 
nality: log I All = 3n/4-|-0(logn) which causes the smaller 
randomness deficiency. 

The same happens also for other model classes, such 
as probability models, see Appendix ^ Consider, for in- 
stance, the class of Bernoulli processes with rational bias 
p for outcome "1" (0 < p < 1) to generate binary strings 
of length n. Suppose we look for the model minimizing 
the codclength of the model plus data given the model: 
K{p\n) + K{x\p,n). Let the data be a; = 00. . .0. Then 
the probability model P (the uniform distribution) with 
P{x) ~ 1/2" corresponding to probability P = \ com- 
presses the data code to K{x \ n,p) — 0{1) bits since we 
can describe x by the program print n ' ' ' ' s, and hence 
need only 0(1) bits apart from n. We also trivially have 
K{p\n) < K{p) + 0(1) = 0(1). But we cannot distinguish 
between the probability model P hypothesis based on p 
and the probability model P' with P'{x) = 1 (singular dis- 
tribution) hypothesis based on p' in terms of tthese code 
lengths: we find the same code length K{x \ n,p') = 0(1) 
bits and K{p'\n) — 0(1) if we replace p = ^ by = in 
these expressions. Thus we have no basis to prefer hypoth- 
esis p or hypothesis p', even though the second possibility is 
overwhelmingly more likely. This shows that ultimate com- 
pression of the two-part code, here for example resulting 
in K{p\n) + K{x\n,p), may yield a (probability) model P 
based on p = i for which the data has the maximal possible 
randomness deficiency (—\ogP{x) — K(x \ n,p) ~ n — 0(1) 
and hence is atypical. 

However, in the structure functions hx{a) and Xx{a) the 
data-to-model code for the model p = ^ is — logP(a;) = 
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— log(^)" — n bits, while p' = results — logP'(.T) ~ 

— log 1" = bits. Choosing the shortest data-to-model 
code results in the minimal randomness deficiency, as in 
(the generalization to probability distributions of) Theo- 
rem EH 

(iv) Another question arising in MDL or maximum like- 
lihood (ML) estimation is its performance if the "true" 
model is not part of the contemplated model class. Given 
certain data, why would we assume they are generated by 
probabilistic or deterministic processes? They have arisen 
by natural processes most likely not conforming to mathe- 
matical idealization. Even if we can assume the data arose 
from a process that can be mathematically formulated, 
such situations arise if we restrict modeling of data aris- 
ing from a "complex" source (conventional analogue be- 
ing data arising from 2fc-parameter sources) by "simple" 
models (conventional analogue being fc-parameter models). 
Again, Theorem IIV. 81 shows that, within the class of mod- 
els of maximal complexity a, these constraints we still se- 
lect a simple model for which the data is maximally typ- 
ical. This is particularly significant for data x if the al- 
lowed complexity a is significantly below the complexity 
of the Kolmogorov minimal sufficient statistic, that is, if 
hxioL) -|- a 3> K(x) ~\- c. This situation is potentially com- 
mon, for example if we have a small data sample generated 
by a complex process. Then, the data will typically be non- 
stochastic in the sense of Section FV-EI For a data sample 
that is very large relative to the complexity of the process 
generating it, this will typically not be the case and the 
structure function will drop to the sufficiency line early on. 

C. Foundations of Maximum Likelihood 

The algorithm based on ML principle is similar to the 
algorithm of the previous example. The only difference is 
that the currently best {p, S) is the one for which log 1 5*1 is 
minimal. In this case the limit hypothesis S will witness 
hx{a) and we obtain the same corollary: (5(a;|S') < Pxic^ — 
0(\ogn)) + 0{logn). 

D. Approximation Improves Models 

Assume that in the MDL algorithm, as described in Sec- 
tion lV-Bl we change the currently best explanation (pi, 5*1) 
to the explanation (p2, <5'2) only if |j32|+log |5'2| is much less 
than |pi|-hlog|S'i|, say |p2|+log 1521 < bij-flog |S'i|-clogn 
for a constant c. It turns out that if c is large enough and pi 
is a shortest program of 5i, then 5{x \ S2) is much less than 
5{x I Si). That is, every time we change the explanation 
we improve its goodness unless the change is just caused 
by the fact that we have not yet found the minimum length 
program for the current model. 

Lemma V.2: There is a constant c such that if A(52) < 
A(5i) - 2clog|2;|, then 5{x \ S2) < S{x \ S2) - c\og\x\ + 
Oil). 

Proof: Assume the notation of Theorem IIV. 81 By 
(|C.4|I . for every pair of sets Si, S2 3 x we have S{x \ S2) — 
Six I Si) = A(52)-A(S'i)+A with A = K{Si \ x*)-K{S2 \ 
X*) + 0(1) < K{Si I S2,x*) + 0(1) < K{Si I S2,x) + 
0(1). As A(52) - A{Si) <\P2\ + log 1^21 - A{Si) = \p2\ + 



log |52| — (|pi|+log 15*11) < —2c log |x| we need to prove that 
KiS2 I Si,x) < clog |a;| +0(1). Note that (pi,5i), (^2,^2) 
are consecutive explanations in the algorithm and every 
explanation may appear only once. Hence to identify Si we 
only need to know p2, S2,a and x. Since p2 may be found 
from 5*2 and length \p2 \ as the first program computing S2 
of length \p2 1 , obtained by running all programs dovetailed 
style, we have iv:(S'2 | 5'i,a;) < 2 log |p2| + 2 log |a| 0(1) < 
4 log |a;| + 0(1). Hence we can choose c = 4. (Continued in 
Section rVTDl i ■ 

E. Non- stochastic Objects 

Let ao,Po be natural numbers. A string x is called 
(ao, Po)- stochastic by Kolmogorov if l3x{cto) ^ Po- In |22| 
it is proven that for some c, C for all n and all ao, (3q with 
2ao + Po < n — clogn — C there is a string x of length 
n that is not (ag, /3o)-stochastic. CoroUarv IIV. 91 strength- 
ens this result of Shen: for some c, C for all n and all 
ao, Po with aQ + Po < n ~ clogn — C there is a string x 
of length n that is not (aoi /3o)-stochastic. Indeed, apply 
Corollary IIV. 91 to k = ao + Ci log n -I- d (we will choose 
ci,Ci later) and the function /3(i) = n — k for i < k and 
P{i) = for i = k. For the x existing by Corollary IIV. 91 we 
have Px{ao) > P{ao ± (c2 logn -I- 02)) - (c2 logn -t- O2) > 
P{k - 1) - (calogn + C2) = n - k - (c2logn + 02) = 
n — (ao -I- ci \ogn + Ci) — (c2 logn + C2) > Po- (The first 
inequality is true if ao + C2logn + C2 < fc — 1; thus let 
ci = C2,Ci = C2 + I- For the last inequality to be true let 
c = ci -I- C2 and C = Ci + O2.) That is, x is not (ao, Po)- 
stochastic. 

VI. Fine Structure and Sufficient Statistic 

Above, we looked at the coarse shape of the structure 
function, but not at the fine detail. We show that hx 
coming from infinity drops to the sufficiency line L de- 
fined by L{a) + a = K{x). It first touches this line for 
some ao < i^(a;) + 0(1). It then touches this line a 
number of times (bounded by a universal constant) and 
in between moves slightly (logarithmically) away in little 
bumps. There is a simple explanation why these bumps 
are there: It follows from (|II.3|I and (|II.5|I that there 
is a constant ci such that for every S 3 x, we have 
K{S) + \og\S\ > K{x) + K{S I X*) - ci. If, moreover, 
K{S)+\Qg\S\ < K{x)+C2, then K{S \ x*) < C2+C1. This 
was already observed in ^Hj. Consequently, there are less 
than 2'^^+'^i+^ distinct such sets S. Suppose the graph of hx 
drops within distance C2 of the sufficiency line at ao , then it 
cannot be within distance C2 on more than 2^=2 +=1+1 points. 
By the pigeon-hole principle, there is a € [ao,K{x)] such 
that hx{a) -I- a > Xx{a) > K{x) -f log(i^(x) — ao) — C2 — 1. 
So if \K{x) — ao| is of order Q.{n) , then we obtain the log- 
arithmic bumps, or possibly only one logarithmic bump, 
on the interval [ao, K{x)]. However, we will show below 
that hx cannot move away more than 0(log |_ftr(a;) — ao|) 
from the sufficiency line on the interval [ao, i^(a::)]. The 
intuition here is that a data sequence can have a simple 
satisfactory probabilistic explanation, but we can also ex- 
plain it by many only slightly more complex explanations 
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that are slightly less satisfactory but also model more acci- 
dental random features — models that are only slightly more 
complex but that significantly overfit the data sequence by 
modeling noise. 

A. Initial behavior 

Let a; be a string of complexity K{x) = k. The structure 
function hj;{a) defined by ljll.8(l rises sharply above the 
sufRciency line for very small values of a with hx{a) — oo 
for a close to 0. To analyze the behavior of hx near the 
origin, define a function 

m{x) = imii{K (y) : y > x}, (VI. 1) 

y 

the minimum complexity of a string greater than x — that 
is, m{x) is the greatest monotonic non-decreasing function 
that lower bounds K{x). The function m(x) tends to in- 
finity as X tends to infinity, very slowly — slower than any 
computable function. 

For every a € [0,m{a:) — 0(1)) we have hx{a) = oo. 
To see this, we reason as follows: For a set S" 9 a; with 
K{S) = a with a in the above range we can consider the 
largest element y oi S. Then y has complexity a -I- 0(1) < 
m{x), that is, K{y) < m(a;), which implies that y < x. But 
then x ^ S which is a contradiction. 

B. Sufficient Statistic 

A sufficient statistic of the data contains all information 
in the data about the model. In introducing the notion 
of sufficiency in classical statistics. Fisher [7] stated: "The 
statistic chosen should summarize the whole of the relevant 
information supplied by the sample. This may be called the 
Criterion of Sufficiency ... In the case of the normal curve 
of distribution it is evident that the second moment is a 
sufficient statistic for estimating the standard deviation." 
For the classical (probabilistic) theory see, for example, 
[H]. In ^U] an algorithmic theory of sufficient statistic (re- 
lating individual data to individual model) was developed 
and its relation with the probabilistic version established. 
The algorithmic basics are as follows: Intuitively, a model 
expresses the essence of the data if the two-part code de- 
scribing the data consisting of the model and the data-to- 
model code is as concise as the best one-part description. 
Formally: 

Definition VIA: A finite set S containing x is optimal 
for X if 

K{S) < K{x) + c. (VI.2) 

Here c is some small value, constant or logarithmic in K{x), 
depending on the context. A minimal length description 
S* of such an optimal set is called a sufficient statistic for 
X. To specify the value of c we will say c-optimal and 
c-sufficient. 

If a set S is c-optimal with c constant, then by l|II.9|l we 
have K{x) — C2 < A(S') < K{x) 4- c. Hence, with respect to 
the structure function {a) we can state that all optimal 
sets S and only those, cause the function A^; to drop to its 
minimal possible value K{x). We know that this happens 
for at least one set, {x} of complexity K{x) + 0{1). 



We are interested in finding optimal sets that have low 
complexity. Those having minimal complexity are called 
minimal optimal sets (and their programs minimal suffi- 
cient statistics). The less optimal the sets are, the more 
additional noise in the data they start to model, see the 
discussion of overfitting in the initial paragraphs of Sec- 
tion IIVI To be rigorous we should say minimal among 
c-optimal. We know from TT? that the complexity of a 
minimal optimal set is at least K{K{x)), up to a fixed ad- 
ditive constant, for every x. So for smaller arguments the 
structure function definitively rises above the sufficiency 
line. We also know that for every n there are so-called 
non- stochastic objects x of length n that have optimal sets 
of high complexity only. For example, there are x of com- 
plexity K{x I n*) = n + 0(1) such that every optimal set 
S has also complexity K{S \ n*) — n-^- 0(1), hence by the 
conditional version K[S \ n*) -\- log \S\ < K{x \ n*) -I- c of 
(jVI.2|l we find 1 5*1 is bounded by a fixed universal constant. 
As K{S \ X*) — 0{1) (this is proven in the beginning of 
this section), for every y E S wc have K{y \ x*) < K{y \ 
S) + K{S I X*) + 0(1) = 0(1). Roughly speaking for such 
X there is no other optimal set S than the singleton {x}. 

Example VI.2: Bernoulli Process: Let us look at the 
coin toss example of Item (iii) in Section IV-BI this time 
in the sense of finite set models rather than probability 
models. Let fc be a number in the range 0, 1, . . . , rt of com- 
plexity \ogn -|- 0(1) given n and let a; be a string of length 
n having k ones of complexity K{x \ n,k) > log (^) given 
n, k. This x can be viewed as a typical result of tossing a 
coin with a bias about p = k/n. A two-part description of 
X is given by the number fc of I's in x first, followed by the 
index j < log 1 5*1 of x in the set S of strings of length n 
with k I's. This set is optimal, since K{x \ n) — K{x,k \ 
n) = K{k I n) + K{x \k,n) = K{S\n) + log <) 

Example VI. 3: Hierarchy of Sufficient Statistics: 

Another possible application of the theory is to find a 
good summarization of the meaningful information in a 
given picture. All the information in the picture is de- 
scribed by a binary string x of length n = ml as follows. 
Chop X into / substrings Xi (1 < i < /) of equal length 
m each. Let ki denote the number of ones in Xi. Each 
such substring metaphorically represents a patch of, say, 
color. The intended color, say "cobalt blue", is indicated 
by the number of ones in the substring. The actual color 
depicted may be typical cobalt blue or less typical cobalt 
blue. The smaller the randomness deficiency of substring 
Xi in the set of all strings of length m containing precisely 
ki ones, the more typical Xi is, the better it achieves a typ- 
ical cobalt blue color. The metaphorical "image" depicted 
by X is 7r(x), defined as the string kik2 ■ . . h over the al- 
phabet {0, 1, . . . ,m}, the set of colors available. We can 
now consider several statistics for x. 

Let X C {0,1,..., mY (the set of possible realizations 
of the target image), and let Yi for i = 0, 1, . . . , m be a 
set of binary strings of length m with i ones (the set of 
realizations of target color i). Consider the set 
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S = {x' : 7r(x') e X, (x), G Yfc, for all i ^ 1, . . . ,1} 

One possible application of these ideas are to gouge how 
good the picture is with respect to the given summarizing 
set S. Assume that x G S. The set S is then a statistic for x 
that captures both the colors of the patches and the image, 
that is, the total picture. If S' is a sufficient statistic of x 
then S perfectly expresses the meaning aimed for by the 
image and the true color aimed for in everyone of the color 
patches. Clearly, S summarizes the relevant information 
in X since it captures both image and coloring, that is, 
the total picture. But we can distinguish more sufficient 
statistics. 
The set 

Si = {x' : n{x') G X} 

is a statistic that captures only the image. It can be suf- 
ficient only if all colors used in the picture x are typical. 
The set 

S2 = {x' : G Yk, for alH = 1, . . . , 

is a statistic that captures the color information in the 
picture. It can be sufhcicnt only if the image is a random 
string of length I over the alphabet {0, 1, . . . , to}, which is 
surely not the case for all the real images. Finally the set 

A, = {x' : (x')^ e FfcJ 

is a statistic that captures only the color of patch {x')i in 
the picture. It can be sufficient only if K{i) w and all 
the other color applications and the image are typical. 

C. Bumps in the Structure Function 

Consider x € {0, 1}" with K{x \n) =n + 0(1) and the 
conditional variant hx{a \ y) = minslloglS*! : S 3 x, \S\ < 
00, K{S \ y) < a} of (|lIHl. Since Si = {0,1}" is a 
set containing x and can be described by 0(1) bits (given 
n), we find hx{a \ n) < n + 0(1) for a = K{Si \ n) = 
0(1). For increasing a, the size of a set S* 9 a;, one can 
describe in a bits, decreases monotonically until for some 
ao we obtain a first set Sq witnessing hxipiQ | n) + ao = 
K{x I n) + 0(l). Then, S'o is a minimal-complexity optimal 
set for X, and Sq is a minimal sufficient statistic for x. 
Further increase of a halves the set S for each additional 
bit of a until a = K{x \ n). In other words, for every 
increment d we have hxiao + d \ n) = K{x \ n) — (ao -t- 
d + 0{\ogd)), provided the right-hand side is non-negative, 
and otherwise. Namely, once we have an optimal set 
S'o we can subdivide it in a standard way into 2'* parts and 
take as new set S the part containing x. The 0{logd) term 
is due to the fact that we have to consider self-delimiting 
encodings of d. This additive term is there to stay, it cannot 
be eliminated. For a > K{x \ n) obviously the smallest set 
S containing x that one can describe using a bits (given 
n) is the singleton set S = {x}. The same analysis can be 
given for the unconditional version hx{a) of the structure 
function, which behaves the same except for possibly the 
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small initial part a G [0, K{n)) where the complexity is too 
small to specify the set 5*1 = {0, 1}", see the initial part of 
Section El 

The little bumps in the sufficient statistic region 
[K {K (x)) , K (x)] in Figure |3| are due to the boundedness 
of the number of sufficient statistics. 

D. "Positive" and "Negative" Randomness 

(Continuing Scction lV-En In 10 the existence of strings 
was shown for which essentially the singleton set consisting 
of the string itself is a minimal sufficient statistic. While 
a sufficient statistic of an object yields a two-part code 
that is as short as the shortest one part code, restricting 
the complexity of the allowed statistic may yield two-part 
codes that are considerably longer than the best one-part 
code (so the statistic is insufficient). This is what happens 
for the non-stochastic objects. In fact, for every object 
there is a complexity bound below which this happens — 
but if that bound is small (logarithmic) we call the object 
"stochastic" since it has a simple satisfactory explanation 
(sufficient statistic). Thus, Kolmogorov in |12| (full text 
given in Section P) makes the important distinction of an 
object just being random in the "negative" sense by hav- 
ing high Kolmogorov complexity, and an object having high 
Kolmogorov complexity but also being random in the "pos- 
itive, probabilistic" sense of having a low-complexity mini- 
mal sufficient statistic. An example of the latter is a string 
X of length n with K{x) > n, being typical for the set 
{0, 1}", or the uniform probability distribution over that 
set, while this set or probability distribution has complex- 
ity K{n) + 0(1) = O(logn). We depict the distinction in 
Figure ^ 
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Corollary II V . 91 establishes that for some constant C, for 
every length n, for every complexity k < n and every ao G 
[0, k], there are x's of length n and complexity k ± Clogn 
such that the minimal randomness deficiency Pxii) > n — 
k — Clogn for every i < aQ — Clogn and Pxii) < Clogn 
for every i > ao + Clogn. Fix e = Clogn and define 
for all s,t = 0, . . . , n/{2e) — 1 the set Ast of all n-length 
strings of complexity K{x) S [(2s — l)e, (2s + 1)£) and 
such that the minimal randomness deficiency Pxii) > n — 
(2s + l)e for every i < {2t — l)e and px{i) < £ for every 
i > (2i + l)e. Corollary IIV.9I implies that every Ast is 
non-empty (let ao — 2te, k — 2se). Note that Ast are 
pair-wise disjoint. Indeed, ii s ^ s' then Ast and As't' are 
disjoint as the corresponding strings x, x' have different 
complexities. And li t ^ t' , say t < t' , then Ast and As't' 
are disjoint, as the corresponding strings x, x' have different 
value of deficiency function in the point i = (2t + l)e: 
f3xii'2t + l)e) > n - (2s + l)e > e > (3x'{{2t + l)e). 

Letting k = ao — n — ^/n we see that there are n-length 
non-stochastic strings of almost maximal complexity n — 
Y^±0(logn) having significant ■yn±0(logn) randomness 
deficiency with respect to {0,1}" or, in fact, every other 
finite set of complexity less than n — O(logn)! 

VII. COMPUTABILITY QUESTIONS 

How difficult is it to compute the functions hx, Xxi Pxj 
and the minimal sufficient statistic? To express the prop- 
erties appropriately we require the notion of functions that 
are not computable, but can be approximated monotoni- 
cally by a computable function. 

Definition VII. 1: A function f : J\f TZ is upper semi- 
computable if there is a Turing machine T computing 
a total function (f> such that ^(a;,i + 1) < (j){x,t) and 
limj^oo '/'(a^j — f{x). This means that / can be com- 
putably approximated from above. If — / is upper semi- 
computable, then / is lower semi-computable. A func- 
tion is called semi- computable if it is either upper semi- 



computable or lower semi-computable. If / is both upper 
semi-computable and lower semi-computable, then we call 
/ computable (or recursive if the domain is integer or ra- 
tional). 

Semi-computability gives no speed-of-convergence guar- 
anties: even though the limit value is monotonically ap- 
proximated we know at no stage in the process how close 
we are to the limit value. The functions hx{a)^ \x{a), Px{a) 
have finite domain for given x and hence can be given as 
a table — so formally speaking they are computable. But 
this evades the issue: there is no algorithm that computes 
these functions for given x and a. Considering them as 
two-argument functions we show the following (we actu- 
ally quantify these): 

• The functions hx{a) and Aa;(a) are upper semi- 
computable but they are not computable up to any rea- 
sonable precision. 

• Moreover, there is no algorithm that given x* and a finds 
hxia) or Xx{a). 

• The function Pxia) is not upper- or lower semi- 
computable, not even to any reasonable precision, but we 
can compute it given an oracle for the halting problem. 

• There is no algorithm that given x and K{x) finds a min- 
imal sufficient statistic for x up to any reasonable precision. 

Intuition: the functions hx and Xx (the ML- estimator 
and the MDL-estimator, respectively) can be monotonically 
approximated in the upper semi-computable sense. But the 
fitness function [3x cannot be monotonically approximated 
in that sense, nor in the lower semi-computable sense, in 
both cases not even up to any relevant precision. 

The precise forms of these quite strong noncomputability 
and nonapproximability results are given in Appendix IdI 

VIII. Realizing the Structure Function 

It is straightforward that we can monotonically approx- 
imate hx and its witnesses (similarly Xx) in the sense that 
there exists a non- halting algorithm A that given any x, a 
outputs a finite sequence pi , p2 , Pa , • • ■ , P/ of pairwise differ- 
ent computer programs each of length at most a + C log \ x\ 
(C is a constant) such that each program pi prints a model 
Si such that l^il > 1521 > ••• > \Si\. This way of com- 
puting hx or Xx is called upper semi-computable, formally 
defined in Definition lVII.il By the results of Section lTvl the 
last model Si is "near" the best possible model according 
to the randomness deficiency criterion: There is no pro- 
gram p of length at most a that prints a model S such that 
the randomness deficiency of x for S is Clog|x| less than 
that of X for Si. Note that we are not able to identify pi 
given X, a, since the algorithm A is non-halting and thus 
we do not know which program will be output last. This 
way we obtain a model of (approximately) best fit at each 
complexity level a, but non-uniformly. 

The question arises whether there is a uniform construc- 
tion to obtain the models that realize the structure func- 
tions at given complexities. Here we present such a con- 
struction. (In view of the non-computability of structure 
functions. Section IVIII the construction is of course not 
computable.) 
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We give a general uniform construction of the finite sets 
witnessing A^;, hx, and j3x^ at each argument (that is, level 
of model complexity), in terms of indexes of x in the enu- 
meration of strings of given complexity, up to the "coarse" 
equivalence precision of Section IIVI This extends a tech- 
nique introduced in lOJ. 

Definition VIII. 1: Let denote the number of strings 
of complexity at most /, and let |A^'| denote the length 
of the binary notation of iV'. For i < |iV'| let stand 
for i most significant bits of binary notation of NK Let 
D denote the set of all pairs {{x,l) \ K{x) < I}. Fix an 
enumeration of D and denote by the minimum index of 
a pair {x,i) with i < I in that enumeration, that is, the 
number of pairs enumerated before {x, i) (if K{x) > I then 
l!j. = oo). Let mf|. denote the maximal common prefix of 
binary notations of and N', that is, I^. = m^O** ■ ■ ■* and 
= TO^ 1 * * • • • * (we assume here that binary notation 
of is written in exactly \N''\ bits with leading zeros if 
necessary). 

(In ^U] the notation is used for mf^ with I = K{x).) 

Theorem VIII. 2: For every i <l, the number Nl is algo- 
rithmically equivalent to N\ that is, K{N' \ NI),K{NI \ 
N') = O(log0. 

Before proceeding to the main theorem of this section we 
introduce some more notation. 

Definition VIII. 3: For i < I let Sl denote the set of all 
strings y such that the binary notation of ly has the form 
* * ■ ■ ■ * (we assume here that binary notations of in- 
dexes are written using exactly |Ai"'| bits.) 

Let c denote a constant such that K{x) < A{S) + c for 
every x ^ S. The following theorem shows that sets S\ 
form a universal family of statistics for x. 

Theorem VIII. 4: (i) If the (z -I- l)st most significant bit 
of iV' is 1 then \Sl\ = 2l^'l-^-i and Sl is algorithmically 
equivalent to N^, that is K{Nl \ Sl),K{Sl \ A^^) = O(log0. 

(ii) For every S and every x G S, let I = A{S) + c and 
i = \mi\. Then x € K{Sl \ S) = 0(log/), K{Si) = 
i + 0{logl) < K{S) + 0{logl), andA{Sl) < A{S) + 0{logl) 
(that is, sl is not worse than S, as a model explaining x). 

(iii) If a is critical then every S witnessing Xx{a) is al- 
gorithmically equivalent to N". That is, if K{S) w a and 
A{S) w Xxia) but X(7V"|S') > or KiS\N°') > then 
there is A 3 x with K{A) <t: a and A{A) w Xx{a). More 
specifically, for all S" 9 a; either X(S'|iV") < K{S) - a and 
K{N°'\S) = 0, or there is A 3 x such that A(^) < A{S) 
and K{A) < min{a - K{N''\S), K{S) - K{S\N"')}, where 
all inequalities hold up to 0(log A(S')) additive term. 

Note that Item (iii) of the theorem does not hold for non- 
critical points. For instance, for a random string x of length 
n there are independent Si,S2 witnessing Xx{^) — n: let 
Sl be the set of all x' of length n having the same prefix of 
length ^ as X and 5*2 be the set of all x' of length n having 
the same suffix of length ^ as x. 

Corollary VIII. 5: Let a; be a string of length n and com- 
plexity k. For every a {K{n) + 0(1) < a < k) there is 
I <n-\- K{n) + 0(1) such that the set 5^ both contains x 
and witnesses hx{a), Xx{a), and Px{oi), up to an O(logn) 
additive term in the argument and value. 



Appendix 
I. Oral History 

Since there is no written version of Kolmogorov's initial 
proposal [TBI J [21 J which we argued is a new approach to 
a "non-probabilistic statistics," apart from a few lines [TC] 
which we reproduced in Section J] we have to rely on the 
testimony of witnesses 0], 231- Says Tom Cover 
"I remember taking many long hours trying to understand 
the motivation of Kolmogorov's approach." According to 
Peter Gacs, "Kolmogorov drew a picture of hx{a) as 
a function of a monotonically approaching the diagonal 
[sufficiency line]. Kolmogorov stated that it was known 
(proved by L.A. Levin) that in some cases it remained far 
from the diagonal line till the very end." Leonid A. Levin 
|13| : "Kolmogorov told me [about] hx{i) (or its inverse, 
I am not sure) and asked how this h{i) could behave. I 
proved that i -\-h{i) + {log i) is monotone but otherwise ar- 
bitrary within 0{\/i) accuracy; it stabilizes on K{x) when i 
exceeds I{x : Halting). (Actually, this expression for accu- 
racy was Kolmogorov's re- wording, I gave it in less elegant 
but equivalent terms — O(plogi) where p is the number of 
"jumps".) I do not remember Kolmogorov defining I3x{i) 
or suggesting anything like your result. I never published 
anything on the topic because I do not believe strings x 
with significant I{x : Halting) could exist in the world." 
{I{x : y) — K{y) — K{y\x) is the information in x about 
y. By ljll.3|) we have I{x : y) — I{y : x), with equality 
holding up to a constant additive term indepennedent of x 
and y, and hence we call this quantity the algorithmic mu- 
tual information. Above, "Halting" stands for the infinite 
binary "halting sequence" defined as follows: The ith bit of 
Halting is 1 iff the zth program for the reference universal 
prefix machine U halts, and otherwise.) 

Remark A.l: Levin's statement jlH] quoted above ap- 
pears to suggest that strings x such that hx{i) + i sta- 
bilizes on K{x) only for large i may exist mathematically 
but are unlikely to occur in nature, because such x's must 
have a lot of information about the Halting problem, and 
hence the analysis of their properties is irrelevant. But the 
statement in question is imprecise. There are two ways to 
understand the statement: (i) hx{i) -I- i stabilizes on K{x) 
when i exceeds I{x : Halting) or earlier; or (ii) hx{i) + i 
stabilizes on K{x) when i exceeds I{x : Halting) and not 
earlier. It is not clear what "the information in x about the 
halting problem" is, since the "Halting problem" is not a 
finite object and thus the notion of information about Halt- 
ing needs a special definition. The usual I{x : Halting) — 
_fi'(Halting) — _fi'(Halting | x) doesn't make sense since both 
_fi'(Halting) and _ft'(Halting | x) are infinite. The expres- 
sion I{x : Halting) = K{x) — K{x \ Halting) looks better 
provided K{x \ Halting) is understood as K{x) relativized 
by the Halting problem. In the latter interpretation of 
I{x : Halting), case (i) is correct and case (ii) is false. The 
correctness of (i) is implicit in Theorem V.4. A counter 
example to (ii) : Let p be the halting program of length at 
most n with the greatest running time. It is easy to show 
that K{p) is about n, and therefore p is a random string of 
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length about n. As a consequence, the complexity of the 
minimal sufficient statistic of p is close to 0. On the 
other hand I{p : Halting) is about n. Indeed, given the 
oracle for the Halting problem and n we can find x; hence 
I{p : Hahing) = K{p) - K{p \ Halting) > n - K{n) > 
n — 2\ogn. 

II. Validity for Extended Models 

Following Kolmogorov we analyzed a canonical setting 
where the models are finite sets. As Kolmogorov himself 
pointed out, this is no real restriction: the finite sets model 
class is equivalent, up to a logarithmic additive term, to 
the model class of probability density functions, as stud- 
ied in 1221, COI- The analysis is valid, up to logarithmic 
additive terms, also for the model class of total recursive 
functions, as studied in '25 . The model class of com- 
putable probability density functions consists of the set of 
functions P : {0,1}* ^ [0,1] with ^ P(a;) = 1. "Com- 
putable" means here that there is a Turing machine Tp 
that, given x and a positive rational e, computes P{x) 
with precision e. The (prefix-) complexity K{P) of a 
computable (possibly partial) function P is defined by 
K{P) = inmi{K{i) : Turing machine Ti computes P}. 
A string x is typical for a distribution P if the randomness 
deficiency S{x \ P) = — logP(x) — K{x \ P) is small. The 
conditional complexity K{x \ P) is defined as follows. Say 
that a function A approximates P if \ A{y, e) — P{y)\ < e for 
every y and every positive rational e. Then K{x \ P) is the 
minimum length of a program that given every function A 
approximating P as an oracle prints x. Similarly, P is c- 
optimal for x if K{P) — log P{x) < K{x) + c. Thus, instead 
of the data-to- model code length log \ S\ for finite set mod- 
els, we consider the data-to-model code length — logP(a;) 
(the Shannon- Fano code). The value — logP(x) measures 
also how likely x is under the hypothesis P and the map- 
ping X t—f Pmin where Pmin minimizes — logP(x) over P 
with K{P) < a is a constrained maximum likelihood esti- 
mator^ see Figure [S] Our results thus imply that such a 
constrained maximum likelihood estimator always returns 
a hypothesis with minimum randomness deficiency. 

The essence of this approach is that we mean maximiza- 
tion over a class of likelihoods induced by computable prob- 
ability density functions that are below a certain complex- 
ity level a. In classical statistics, unconstrained maximal 
likelihood is known to perform badly for model selection, 
because it tends to want the most complex models possible. 
This is closely reflected in our approach: unconstrained 
maximization will result in the computable probability dis- 
tribution of complexity about K{x) that concentrates all 
probability on x. But the structure function hx{cx) tells us 
all stochastic properties of data x in the sense as explained 
in detail in the start of Section Hvl for finite set models. 

The model class of total recursive functions consists 
of the set of computable functions p : {0, 1}* 
{0, 1}*. The (prefix-) complexity K{p) of a total re- 
cursive function p is defined by K{p) = Tiim.i{K{i) : 
Turing machine Ti computes p}. In place of log l^l for fi- 
nite set models we consider the data-to-model code length 




minimal sufficient statistic | K(x) j 



a 



Fig. 5 

Structure function 
ha:{a) = minp{- logP(a:) : P{x) > 0, K{P) < a} WITH P A 
COMPUTABLE PROBABILITY DENSITY FUNCTION, WITH VALUES 
ACCORDING TO THE LEFT VERTICAL COORDINATE, AND THE MAXIMUM 
LIKELIHOOD ESTIMATOR 
2-hx(a) = max{P(x) : P{x) > 0, K{P) < a}, WITH VALUES 
ACCORDING TO THE RIGHT-HAND SIDE VERTICAL COORDINATE. 



^xip) = min{|c?| : p{d) — x}. A string x is typical for 
a total recursive function p if the randomness deficiency 
S{x \ p) = lx(jp) — K{x \ p) is small. The conditional com- 
plexity K(x I p) is defined as the minimum length of a 
program that given p as an oracle prints x. Similarly, p is 
c-optimal for x if K{p) -\- lx{p) < K{x) -\- c. 

It is easy to show that for every data string x and a 
contemplated finite set model for it, there is an almost 
equivalent computable probability density function model 
and an almost equivalent total recursive function model. 

Proposition B.l: For every x and every finite set S 3 x 
there is: 

(a) A computable probability density function P with 
-logF(x) = logl^l, 6{x I P) = d{x I S) +0(1) and 
K{P) = K{S) + Oil); and 

(b) A total recursive function p such that lx{p) < log 15*1, 
S{x I p) < Six I S) + 0(1) and K{p) = K{S) + 0(1). 

Proof: (a) Define P{y) = 1/\S\ for y G 5 and 
otherwise. 

(b) If 5 = {a;o, . . . , a^m^i}, then define p{d) = Xd mod m- 

■ 

The converse of Proposition IB.ll is slightly harder: for 
every data string x and a contemplated computable proba- 
bility density function model for it, as well as for a contem- 
plated total recursive function model for x, there is a finite 
set model for x that has no worse complexity, randomness 
deficiency, and worst-case data-to-model code for x, up to 
additive logarithmic precision. 

Proposition B.2: There are constants c, C, such that for 
every string x, the following holds: 

(a) For every computable probability density function 
P there is a finite set S 3 x such that log|5| < 
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-logP(a;) + 1, S{x I S) < S{x \ P) + 2\ogK{P) + 
K{ [- log P{x)\ ) + 2 log K{ [~ log P{x)\ ) + C and K{S) < 
K{P) + K{[- log P{x)\ ) + C; and 

(b) For every total recursive function p there is a finite set 
S 9 a; with log |5| < l^{p), S{x \ S) < S{x \ p) + 2logK{p) + 
Kmp)) + 21ogKmp)) + c and K{S) < K{p)+K{l,{p)) + 
c. 

Proof: (a) Let m = [-logP(a;)J, that is, 2-"-^ < 
P{x) < 2-™. Define S = {y : P{y) > 2-"-!}. Then, 
15*1 < 2'"+^ < 2/P{x), which implies the claimed value 
for log|5|. To list S it suffices to compute all consecutive 
values of P{y) to sufficient precision until the combined 
probabilities exceed 1 - 2-™-i. That is, K{S) < K(P) + 
K{m) + 0(1). Finally, 5{x \ S) = log|5| - K{x\S*) < 
- \ogP{x) - K{x I S**) + 1 = 6{x I P) + k[x I P) - K{x \ 
5*) + 1 < d{x I P) + K{S* I P) + 0(1). The term K{S* \ 
P) can be upper bounded as K{K{S)) + K{m) + 0(1) < 
21og7^(5) + i^(TO) + 0(l) < 2\og{K{P) + K{m))+K{m) + 
Oil) < 2\ogK{P) + 2\ogK{m) + K{m) + 0(1), which 
implies the claimed bound for 5{x \ S). 

(b) Define S = {y : p{d) = y, |d| = Up)}. Then, 
logjS'l < lx{p)- To list S it suffices to compute p{d) for 
every argument of length equal lx{p)- Hence, K{S) < 
K{p) + K[lx{p)) + 0(1). The upper bound for 5{x \ S) 
is derived just in the same way as in the proof of item (a). 

■ 

Remark B.3: How large are the nonconstant additive 
complexity terms in ProDOsition lB.2l for strings x of length 
n? In item (a), we are commonly only interested in 
P such that k{p) < n + O(logn) and -logP(a;) < 
n + 0(1). Indeed, for every P there is P' such that 
K{P') < min{if(P),n} + O(logn), S{x \ P') < 6{x \ 
P) +0{logn), -~\ogP'{x) < min{-logP(x),n} + 1. Such 
P' is defined as follows: If K{P) > n then P'{x) = 1 and 
P'(y) = for every y ^ x; otherwise P' = {P + [/„)/2 
where [/„ stands for the uniform distribution on {0, 1}". 
Then the additive terms in item (a) are O(logn). In 
item (b) we are commonly only interested in p such that 
K{p) < n+0(logn) a,ndlx{p) < n+0{l). Indeed, for every 
p there is p' such that K{p') < min{_ft'(p), n} + O(logn), 
Six I p') < Six I p) + O(logn), lx{p') < min{lxip),n} + 1. 
Such p' is defined as follows: If K{p) > n then p' maps all 
strings to x; otherwise p'{Ou) = p{u) and p'{lu) = u. Then 
the additive terms in item (b) are O(logn). Thus, in this 
sense all results in this paper that hold for finite set models 
extend, up to a logarithmic additive term, to computable 
probability density function models and to total recursive 
function models. Since the results in this paper hold only 
up to additive logarithmic term anyway, this means that all 
of them equivalently hold for the model class of computable 
probability density functions, as well as for the model class 
of total recursive functions. <^ 

III. Proofs 

Proof: Lemma lIV.2I The ineaualitv \^(a) < hx{a) + 
a is immediate. So it suffices to prove that hx{o) + a < 
Ax(a) + K{a) + 0(1). The proof of this inequality is based 
on the following: 



Claim C.l: Ignoring additive K{i) terms the function 
hx{i) + i does not increase: 

hx{i2) +i2< hx{ii) +11+ K{i2 I ^l) + 0(1) (C.l) 

for ii < ^2 < K{x). 

Proof: Let S' be a finite set containing x with K{S) < 
ii and logjS*! = hx{ii). For every m < log|S'|, we can 
partition 5* into 2™ equal-size parts and select the part 5" 
containing x. Then, log|S"| = logj^j — m at the cost of 
increasing the complexity of S' to 

K{S') < K{S) + m + K{m \ K{S)) + 0(1) 

(we specify the part S' containing x by its index among all 
the parts). Choose 

m = i2-K{S)-K{i2 \K{S))-c 

for a constant c to be determined later. Note that 

K{m I K{S)) < K{i2, K{i2 \ K{S)) \ K{S)) + K{c) + c' 
^K{i2 I K{S)) + K{c)+c" 

for appropriate constants c', c". The complexity of the re- 
sulting set S' is thus at most 

K{S) +12- K{S) - K{i2 I K{S)) - c 

+ K{i2 \K{S)) + K{c)+c" <i2, 

provided c is chosen large enough. Hence, hx{i2) < 
log = hx{ii)~m = hx{ii)-i2+K{S)+K{i2 I K(S))+c, 
and it suffices to prove that K{S) + K{i2 \ K{S)) < 
ii + K{i2 I ii) + 0(1). This follows from the bound 
K{i2 I K{S)) < K{i2 I ii) + K{n I K{S)) + 0(1) < K{i2 \ 
ii) + K{ii-K{S)) + Oil) < K{i2 I ii)+ii-K{S) + 0{l). 

■ 

Let 5* witness Xx{a)- Substituting K{S) — i^, a — 12 
(irnil we obtain: hx{a) +a< hx{K{S)) + K{S) + K{a \ 
KiS))+Oil) < A{S)+K{a)+0{l) = A,(a)-l-if (a)-f 0(1). 

■ 

Proof: Theorem IIV.4I (i) We first observe that for 
every x of length n we have Xx{K{n) + 0{l)) < n + K{n) + 
0(1), as witnessed hy S — {0, 1}". At the other extreme, 
Xxik + 0(1)) = k + 0(1), as witnessed hy S = {x}. 

Define X{i) by the equation X{i) — k — max{0, Aa;(z -I- 
K{n) + 0(1)) - fc - 0(1)}. Then A:^ = ^(A) with 5 ^ 
K{n) + 0(1), and A satisfies the requirements of Item (i) 
of the theorem. 

(ii) Fix X{i) satisfying the conditions in the theorem. It 
suffices to show that there is a string x of length n such 
that, for every i S [0, A:], we have Xx[i) > X{i) and Xx{i + 
S{i)) < A(i) -t- S{i) for 6{i) = K{i, n, X) + 0(1). Then, with 
S = 6{k), we have K{x) < Xx{k + 6) + 0(1) < A(fc) +S + 
0(1) = k + 6 + 0{l). And the inequality Xxik) > X{k) = k 
implies that K{x) > k - 0(1). 

Claim C.2: For every length n, there is a string x of 
length n such that Xx{i) > X{i) for every i in the domain 
of A. 
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Proof: Fix a length n. If A^(z) < A(z) then x belongs 
to a set A with A{A) < A(i) < A(0) < n. The total number 
of elements in different such ^'s is less than 2"^^(^) — 
2" 'J2a 2^^("^^ < 2", where the second inequality follows by 
lllL2ll . ■ 
We prove Item (ii) by demonstrating that the lexico- 
graphically first defined in Claim [^21 also satisfies 
Xxii + Sii)) < A(i) + S{i), for S{i) = K{i, n, A) + 0(1) for 
all is [0, fc]. It suffices to construct a set 5 9 a; of cardi- 
nality 2^^*)^* and of complexity at most i + S{i), for every 
i e [0,k]. 

For every fixed i € [0, k] we can run the following: 
Algorithm: Let A be a set variable initially containing 
all strings of length n, and let S' be a set variable initially 
containing the 2'*''*^"* first strings of A in lexicographical 
order. Run all programs of length at most n dovetail style. 
Every time a program p of some length j halts, A(j) is de- 
fined, andp prints a set B of cardinality at most 2^^^'^^ , we 
remove all the elements of B from A (but not from S); we 
call a step at which this happens a j-step. Every time SnA 
becomes empty at a j-step, we replace the contents of S by 
the set of the 2'^'^*)"* first strings in lexicographical order of 
(the current contents of) A. Possibly, the last replacement 
of iS* is incomplete because there are less than 2'*'^*)"' ele- 
ments left in A. It is easy to see that x G S \ A just after 
the final replacement, and stays there forever after, even 
though some programs in the dovetailing process may still 
be running and elements from A may still be eliminated. 

Claim C.3: The contents of the set S is replaced at most 
2*+i times. 

Proof: There are two types of replacements that will 
be treated separately. 

Case 1: Replacement of the current contents of S where 
at some j-step with j < i at least one element was removed 
from the current contents S D A. Trivially, the number of 
this type of replacements is bounded by the number of j- 
steps with j < i, and hence by the number of programs of 
length less than i, that is, by 2*. 

Case 2: Replacement of the current contents of S where 
every one of the 2^*^*'~' elements of the current contents of 
S is removed from A by j-steps with j > i. Let us estimate 
the number of this type of replacements: Every element x 
removed at a j-step with j > i belongs to a set B with 
A{B) < A(j) < X{i). The overall cumulative number of 
elements removed from A on j-steps with j > i is bounded 
by J2 b 2^(*)"^(^) < 2^(*), where the inequahty follows by 
(|IL2|I . Hence replacements of the second type can happen 
at most 2^(*)-(^(*)-') = 2* times. ■ 

By Claim IC.3I S stabilizes after a certain number of j- 
steps. That number may be large. However, the number of 
replacements of S is small. The final set S B x has cardinal- 
ity 2^(*)-% and can be specified by the number of replace- 
ments resulting in its current contents (as in Claim . 
and by i, n, A. This shows that K{S) < i+K{i, n, A)+0(1). 

■ 

Proof: Theorem llV.SI The statement of the theorem 
easily follows from the following two inequalities that are 



valid for every x (where n = |x| and k — K{x)): 

f3x{i) + k<K{i)+0{l), for every i < A:; and (C.2) 
A,(^ + O(logn)) < p^i) + k + O(logn), (C.3) 
for every i satisfying K{n) + 0(1) < i < k. 

It is convenient to rewrite the formula defining d{x \ A) 
using the symmetry of information ljll.3|) as follows: 

d{x I A) = log + K{A) - K{A \ x*) - k + 0(1) (C.4) 
^ A{A) - K{A \ X*) - k + 0(1). 

Ad fCl2|l : This is easy, because for every set S ^ x 
witnessing Xx{i) we have S{x \ S) < A{S) — k + 0(1) = 
-k + 0(1) and /?^(i) < 5{x \ S). 

Ad (|rT3)l : This is more difficult. By (|(L4p . and the obvi- 
ous K{A I X*) < K{A I x) + 0(1), it suffices to prove that 
for every A 3 x there is an S" 9 a; with 

K{S) < K{A) + 0{logm), 

\og\S\ < log\A\~K{A I x) + 0{logm), 

where m = A(^). Indeed for every A witnessing Px{i) the 
set S will witness Xx{i + 0{logn)) < Px{i) + k + O(logn) 
(note that m = \og\A\ + K{A) = K{x \ A*) + + 
K{A) < 3n + O(logn) provided i > K{n) + 0(1)). The 
above assertion is only a little bit easier to prove than the 
one in Lemma FC. 41 below that also suffices. Since we need 
this lemma in any case in the proof of Theorem IIV.III we 
state and prove it right now. 

Lemma C.4-' For every A 3 x there is S 3 x with 
K{S) < KiA) ~ K{A I x) + O(logm) and [loglS-l] = 
[log |^|] (where m = A(^) ). 

Proof: Fix some Aq 3 x and let m = A{Ao). Our 
task is the following: Given K{Ao), \log\ AqW , K{Ao \ x), 
to enumerate a family of at most 2^('4«)--^(^ol^)+o(i°8'") 
different sets S with loglS"] — [logl^oll that cover all y's 
covered by sets A, with K{A) = K{Aq), K{A \ y) = 
K{A[) I x) and [log|A|] = [log|Ao|]. Since the complexity 
of each enumerated S does not exceed K{Ao) — K{Aq \ 
x) + O(logm) + K{K{A^), [log j^oll , K{Ao \ x)) + 0(1) = 
K{Aq) — K{Aq I x) -\- O(logm) the lemma will be proved. 
The proof is by running the following: 

Algorithm: Given K{Ao), [log j^oll , ^(^o I x) we run 
all programs dovetail style. We maintain auxiliary set- 
variables C,U,D, all of them initially 0. Every time a new 
program p of length K{Ao) in the dovetailing process halts, 
with as output a set A with [log |^|] = [log I^oIIj we exe- 
cute the following steps: 
Step 1: Update U := U U A. 

Step 2: Update D := {y e U \ C: y is covered by at 
least t = 2^('^"l^)"'' different generated ^'s}, where S = 
O(logm) will be defined later. 

Step 3: This step is executed only if there is y £ D that is 
covered by at least 2t different generated A's. Enumerate 
as much new disjoint sets S as are needed to cover D: we 
just chop D into parts of size 2r'°sl'^on (the last part may 
be incomplete) and name those parts the new sets S. Every 
time a new set S is enumerated, update C := C U S. 
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Claim C.5: The string x is an element of some enumer- 
ated S*, and the number of enumerated S"s is at most 

2K(Aa)~K{Ao\x)+0(\ogm) _ 

Proof: By way of contradiction, assume that x is 
not an element of the enumerated 5"s. Then there are 
less than 2^('^ol^)~''+^ different generated sets A such that 
X G A. Every such A therefore satisfies K{A \ x) < K{Ao \ 
x) — 6 + O(logm) < K{Ao \ x) if S is chosen appropriately. 
Since Aq was certainly generated this is a contradiction. 

It remains to show that we enumerated at most 
2K{Ao)-K{Ao\x)+o(ioem) different 5's. Step 3 is executed 
only once per t executions of Step 1, and Step 1 is ex- 
ecuted at most 2^'-^"^ times. Therefore Step 3 is exe- 
cuted at most 2^(^o)/t = 2^(^o)-^('4o|a;)-H5 rj,^^ 

number of S's formed from incomplete parts of D^s in 
Step 3 is thus at most 2^('^")--^(^")+'^. Let us bound 
the number of S's formed from complete parts of D's. 
The total number of elements in different A's generated is 
at most 2^(^o)+ri°gl^on counting multiphcity. Therefore 
the number of elements in their union, having multiplicity 

2KiAo\x)-S jg j^Qg^ 2^('4o)+ri°gl'4on--ff(Aob)-H5_ 

Every S formed from a complete part of a set D in 
Step 3 accounts for 2r'°sl^on of them. Hence the num- 
ber of 5"s formed from complete parts of D's is at most 

2K(Ao)-K{Ao\x)+S^ g 



Proof: Theorem IIV.III By Lemma IC.4I there is 
S B X with K{S) < K{A) - K{A \ x) + 0{\ogA{A)) and 
[log 1511 = [log 1^11. 
Let us upper bound first K{S). We have 

K{S) < K{A) - K{A I x) = 6{x\A) + fc - log \A\ 
= I3^{a) + k- log \A\ + {5{x\A) - /3,(a)) 
< A,(a)- log + 

(all inequalities are valid up to 0(log A(A)) additive term). 
The obtained upper bound is obviosly equivalent to the 
first upper bound of K{S) in the theorem. As logl^l — 
log \ A\ it gives the upper bound of A(S') from the theorem. 
Finally, as Xx{a) < hx{a) + a + 0(1) we obtain K{S) < 
a+ihx{a)-\og \A\) + {S{x\A) - /3,(a)) (up to 0(log A(A)) 
additive term). ■ 
Proof: Theorem IVlII.2I We first show that |to^| < 
K{x) + 0(log/) for every x with K{x) < I. Indeed, given 
X, I, |to^| and the \N'- \ — |to^| least significant bits of we 
can find N^: find by enumerating D until a pair {x,i) 
with i < I appears and then complete mf^ by using the 
|m?p| most significant bits of the binary representation of 
J^. Given I and iV' we can find, using a constant-length 
program, the lexicographically first string not in NK By 
construction, this string has complexity at least /-t-1. Then, 
I < K{N^) + Oilogl) < Kix) + \N^\- \mi\ + 0(\ogl) < 
K{x) + l-\ml\+ 0{\ogl) (use \N^\ < 1 + 0(1)). Thus, 
\mi\<Kix) + 0{logl). 

Let X be the string of complexity at most i with maxi- 
mum 7^. Given and i,l,\N''\ we can find all strings of 
complexity at most i by enumerating D until N pairs {y,j) 



with j < I appear, where N is the number whose binary 
representation has prefix mf^l and then (|iV'| — |m^| — 1) 
zeros. Since |m^| < i + O(logZ), this proves K{N^ \ N- ) = 
0{\ogl). Since K{N') > i - 0{\ogi) > K{Nf) - O(logi) 
we have K{Nl \ N') = 0(log/). ■ 
Proof: Theorem IVIII.4I (i) If the (i + l)st most sig- 
nificant bit of iV' is "1," then all the numbers with binary 
representation of the form A^^'O * * • • • * are used as indexes 
of some y with K{y) < I, that is, S*- has exactly 2l^'l~'~^ 
elements. We can find S*' given i, |A^'| and Nl by enu- 
merating all its elements. On the other hand, A^,- can be 
found given 5- and i , ^ as the first i bits of for every 
X G Si. 

(ii) Since i = \m[,\, the largest common prefix of binary 
representation of and A^' has the form A^^-O *>!=••• * and 
the (i + l)st most significant bit of A^' is 1. In particular, 
X e 5*^ 

Let J = max{/^ \ y e S}. Asx e S, we have J > I^.. We 
can find A^^' given i, I and S by finding J and taking the i 
first bits of J. Given we can find Si- Hence K{Si \ S) = 
0(log/). Therefore K{Sl) < K{S) + Oilogl). By Item (i) 
and by previous theorem we have K{S\) = i + O(logZ). 
Again by Item (i) we have K{S\) < I + 0{\ogl) = A{S) + 
0(log/). 

(iii) Let i — \ml.\. We distinguish two cases. 

Case 1: i > a. Then K{N°' \ S) < K{N°' \ S\) + 
0{\ogl) < K{N" I A^*) + C'(logZ) = 0(log/). And K{S \ 
N°') = K{S) - K{N°') + 0{\ogl) = K{S) -a + Oilogl). 

Case 2: i<a. Let A = Sl As A{Sl) < A{S) + Oilogl) 
we need to prove that KiSl) < a-KiN°'\S) and KiS^) < 
KiS) - KiSlN"') up to 0(logO additive term. We have 

KiSl) = A:(A^") - KiN"\Sl) + Oilogl) 
< a- KiN°'\S) + Oilogl) 

and 

KiSl) ^ KiS) - KiS\Sl) + Oilogl) 
< KiS) - KiSlN"') + Oilogl). 



IV. COMPUTABILITY PROPERTIES 

A. Structure Function 

It is easy to see that h^ia) or Xxia), and the finite set 
that witnesses its value, are upper semi-computable: run 
all programs of length up to a dovetailed fashion, check 
whether a halting program produced a finite set containing 
X, and replace the previous candidate with the new set if 
it is smaller. 

The next question is: Is the function X^ia), as the func- 
tion of two arguments, computable? Of course not, because 
if this were the case, then we could find, given every large 
k, a string of complexity at least k. Indeed, we know that 
there is a string x for which Xxik) > k. Applying the al- 
gorithm to all strings in the lexicographical order find the 
first such X. Obviously Kix) > k — Oil). But it is known 
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that we cannot prove that K{x) > k for sufficiently large 

k, m]. 

Assume now that we are given also K{x). The above 
argument does not work any more but the statement re- 
mains true: Xxict) is not computable even if the algorithm 
is given K{x). 

Assume first that the algorithm is required to output the 
correct answer given any approximation to K{x). We show 
that no algorithm can find A that is close to Aa; (a) for some 
< a < K{x). 

Theorem D.l: For every constant c there is a constant 
d such the following holds. There is no algorithm that 
for infinitely many fc, given k and x of length k + dlogk 
with \K{x) — fc| < 21ogfc, always finds A such that there is 
21ogfc < a < k with |Aa;(Q;) — A| < clog A:. 

Proof: Fix c. The value of d will be chosen later. The 
proof is by contradiction. Let A be some algorithm. We 
want to fool it on some pair {x, k). 

Fix large k. We will construct a set S of cardinality 
2A;-2 log k gm^j^ tli&i every string xu\ S has length k + d log k 
and complexity at most fc + 2 log fc, and the algorithm halts 
on (x, k) and outputs A > (c + 1) logfc. This is a contra- 
diction. Indeed, there is a; G S* with K{x) > k — 2 logfc. 
Hence the output A of A on {x, k) is correct, that is, there 
is a with 21ogfc < a <k and \\x{a) — A| < clogfc. Then 
Xx{<y) > logfc. On the other hand, A(2 logfc) < fc as wit- 
nessed by 5*. Thus we obtain 

fc < A^(a) < A:r(2 logfc) < fc, 

a contradiction. 

Run in a dovetailed fashion all programs of length fc or 
less. Start with x equal to the first string of length fc + 
d logfc and with S = B = %. Run A on (z, fc) and include 
in B all strings x' such that either a program p of length 
at most fc has halted and output a set C 9 x' with \p\ + 
\og\C\ < k+ {2c+ 1) logfc, or we find out that K{x') < 
fc — 2 log fc. Once x gets in B we change x to the first string 
of length fc -I- dlog fc outside BUS. (We will show that at 
every step it holds \B U S\ < 2'=+'*i°s'=.) 

We proceed in this way until A{x, fc) prints a number A 
or the number of changes of x exceed 2*^+^ . (Actually, we 
will prove that the number of changes of x does not exceed 
2fc+i ^ 2'=-2i°g'^.) Therefore K{x) <k + 2logk for all our 
x's so we eventually will find x such that A{x, fc) outputs 
a result A. If A > fc -I- (c 4- 1) logfc then include a; in S* 
and then change x to the first string of length k + d log fc 
outside (the current version of) BUS. Otherwise, when 
A < fc -|- (c -|- 1) logfc, let Xx be the current approximation 
of Xx. We know that x is outside all known sets C with 
K{C) < fc, K{C)+log}C\ < fc-|-(2c-t-l) logfc. Therefore, for 
every a < fc it holds Xx{a) > k + (2c -I- 1) logfc and hence 
|Aj:(a) — A| > clogfc. This implies that either K{x) < 
fc — 2 log fc or Aa; differs from A^, . So we are sure that at 
least one more program of length fc or less still has to halt. 
We wait until this happens, then include x in B and change 
X to the first string of length k + d log fc outside BUS. 

Once we get 2'^"^'°^'^ elements in S we halt. Every 
change of x is caused by a halting of a new program of 



length at most fc or by including x in 5, thus the total 
number of changes does not exceed 2*^+^ + 2'^^^'°s'=. 
Note that at every step we have 

provided that d > 2c -I- 1 . ■ 
What if the algorithm is required to approximate A^; only 
if it is given the precise value of K{x)l We are able to prove 
that in this case the algorithm cannot compute Xx{oi) too. 
It is even impossible to approximate the complexity of min- 
imal sufficient statistic. To formulate this result precisely 
consider the following promise problem: 
Input: x,k = K{x), a € [e, fc — e]. 
Output: 

1, if Aa;(a — e) < fc + 6 logfc, 
0, if A3;(q! + e) > fc + 3e. 

If neither of two above cases occurs the algorithm may 
output any value or no value at all. 

Theorem D.2: There is no algorithm A solving this 
promise problem for all x and e — |x|/101og|a;|. 

Corollary D.3: There is no algorithm that given x,k = 
K{x) finds an integer valued function A on [0, fc] such that 
A^ = £{X) for e = ^ = |a;|/101og 

Indeed, if there were such algorithm we could solve the 
above promise problem by answering 1 when A(q;) < fc + 2e 
and otherwise. 

Proof: The proof is by contradiction. The idea is 
as follows. Fix large fc. We consider N = O(logfc) points 
ai, . . . ,a]\[ that divide the segment [0, fc] into equal parts. 
We lower semicompute A^ and K{x) for different a;'s of 
length about fc + 4e. We are interested in strings x with 
Aa;(Q!i + e) > fc + 3e where A^, is the current approx- 
imation to Xx. By counting arguments there are many 
such strings. We apply the algorithm to {x,K{x),ai) for 
those x^s, where K{x) stands for the currently known up- 
per bound for K{x). Assume that A{x, K{x), ai) halts. 
If the answer is 1 then we know that K{x) < K{x) or 
Aa;(Q!i -|- e) < Xx{o:i + e) and we continue lower semicom- 
putation until we get know which of two values K{x) or 
Ax(ai -I- e) gets smaller. If the latter is decreased we just 
remove x (the total number of removed x will not increase 
2fc+3e g^jj^j thus they form a small fraction of strings of 
length fc + As). If for many x's the answer is we make 
those answers incorrect by including those x's in a set of 
cardinality 2'^~"i+^^ and complexity ai — e. Then for all 
such x's Xx{oLi — e) < fc + e and thus algorithm's answer 
is incorrect. Hence K(x) < K{x) and we continue lower 
semicomputation. For all those x's for which K{x) is de- 
creased we repeat the trick with a2 in place of ai. In this 
way we will force K{x) to decrease very fast for many x's. 
For most of x's K(x) will become much less than fc, which 
is impossible. 

Here is the detailed construction. Fix large fc. Let N = 
3 logfc, 6 — fc/9 logfc (one third of the distance between 
consecutive a,), = fc — 3di + d, n = k + A5 (the length of 
x) . The value of parameter e is chosen to be slightly less 
than S (we will need that 6 > e + A log fc for large enough 
fc). 
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We will run all the programs of length at most k' = 
k + 2 log k and the algorithm A on all possible inputs in a 
dovetailed fashion. 

We will define a set X of 2*' strings of length n. Our 
action will be determined by k only, hence K{x) < k + 
2 log A; = k' for every x G X provided k is large enough. 
We will also define some small sets Bi for I = 1, . . . , N, the 
sets of "bad" strings and B will denote their union. Every 
Bi will have at most 2''"* elements. We start with Bi = ^ 
iovl = l,...,N. 

We make 2*^ stages. At every stage consider the sets 

Xi = {x&X\B: k{x) = k' + 1-1] for / = 1, . . . , TV, 
Xo = {x&X\B: K{x) > k'}. 

Before and after every stage the following invariant will be 
true. 

(1) \Xi\ < 2^^' for every < Z < TV; in particular Xq = 0. 

(2) For all 1 < / < AT for all x e Xi it holds 
A{x,K{x),ai)^0. 

(3) For all < ? < i < TV and all a; e it holds K{ai + 
S)>k + 5S. 

(4) < 2^*''"^''+^ X (the number of programs of length 
at most k — 3i6 + 26 that have halted so far). 

At the start all A^'s and i?i's arc empty so the invariant 
is true. Each stage starts by including a new element in 
X. This element is the first string xq of length n = fc + 45 
outside X such that Xx{a) > /c + 3(5 for all a < fc. Thus by 
the choice of Xq the assertions (3) and (4) remain true but 
(1) and (2) may not. 

We claim that continuing the dovetailing and updating 
properly Sj's we eventually make every one of (1), (2), (3) 
and (4) true. During the dovetailing the sets Xi change 
(an clement can move from A; to Xi for i > I and even to 
A \ (Ao U • • • U Aat)). We will denote by Xi the version 
of Xi at the beginning of the stage (and Xq = {xq}) and 
keep the notations Xi,Bi for current versions of Xi^Bj, 
respectively. The rule to update Bi's is very simple: once 
at some step of the dovetailing a new set C of complexity 
at most k — Si5 + 26 = ai + S appears, we include in Bi 
all the elements of the set U}=o -^J ■ As A; c Uj=o -^j ^^^^ 
keeps (3) true. Moreover, this keeps true also the following 
assertion: 

(5) For alll<l<N for all x G Xi\Xi it holds A^(ai+5) > 
k + 36. 

And this also keeps (4) true since Uj=ol^jl < 1 + 

We continue the dovetailing and update Bi 's as described 

until both (1) and (2) arc true. Let us prove that this 
happens eventually. It suffices to show that if (3), (4) and 
(5) are true but (2) is not, or (2), (3), (4) and (5) are true 
but (1) is not then at least one program of length < k' will 
halt or A{x,K{x),ai) is undefined for some I and some 
x e Xi. 

Consider the second case: (2), (3), (4) and (5) are true 
but (1) is not. Pick I such that \Xi\ > 2^^K U I = 0, that 
is, K{xo) > k', we are done, as K{xo) < k' . Otherwise, let 
S consist of the first 2^^^ elements in A";. We claim that 



K{S) < k — 316 + 41ogfc < ai — e. To prove the claim we 
will show that all S d Xi obtained in this way are pairwise 
disjoint, therefore their number is at most 2^ 12?^^ . Thus 
S may be identified by fc,Z and its index among all such 
S C A|. 

Therefore for all a; G 5 we have Xxipii—e) < k + 4 log k < 
K{x) + 6\ogK{x) and the value A{x, K{x), a;) = is not 
correct. This implies that K{x) is not correct for all x G S. 
We continue the dovetailing until all elements of S move 
outside A; . Then S becomes disjoint with Aq U • • • U A; 
and therefore it will be disjoint with all future versions of 
Xi. 

Consider the first case: (3), (4) and (5) are true but 
(2) is not. Pick I and x G X; such that A{x,K{x),ai) 
is undefined or A{x,K{x),ai) = 1. If A{x , K (x)^, ai) is 
undefined then we are done: since (a; + s) > Xx {cti + 
6) > k + 36 > K{x) + 3e, either or K will decrease, 
or A{x, K{x),ai) will get defined. Consider the other case. 
Obviously X ^ Xi\Xi. By (5) we have Xx{ai+e) > Xx{ai + 
6)>k + 36> K{x) + 3e. Therefore Xx{ai + 6) < Xx{ai + 6) 
or K{x) < K{x) and we are done. 

After 2*^ stages the set |X| has 2*^ elements and we have 
a contradiction. Indeed, all Xi, . . . , Ajv form a very small 
part of A because of (1). The sets Bi, . . . ,Bn together 
form also a very small part of X because of (4). Thus for 
most strings x ^ X it holds K(x) < k' — N -\- 1 ^ k which 
is a contradiction. ■ 

Remark D.^-' Let us replace in the above promise prob- 
lem K{x), the prefix complexity of x, by C{x), the 
plain complexity of x. For the modified problem we can 
strengthen the above theorem by allowing £ = \x\/ c where 
the constant c depends on the reference computer. Indeed 
for every x € X we have C{x) < k + 0(1): every x € X 
can be described by its index in A in exactly k bits and 
the value of k may be retrieved from the length of the de- 
scription of X. Therefore we will need N = 0(1) to obtain 
a contradiction. <0> 

After a discussion of these results, Andrei A. Muchnik 
suggested, and proved, that if we are also given an ao such 
that Xx{ao) w K{x) but Xx{a) is much bigger than K(x) 
for a much less than ao (which is therefore the complexity 
of the minimal sufiicient statistic), then we can compute 
Xx over all of its domain. This result underlines the signifi- 
cance of the information contained in the minimal sufficient 
statistic: 

Theorem D.5: There are a constant c > and an algo- 
rithm that given any x, k, ao with K{x) < k < Xx{ao) finds 
a non- increasing function A defined on [0, k] such that A^, = 
f (A) with 6 — Xx{ao)—K{x)+O{l) and e = ao— Q!i-|-clog A; 
where ai = min{a : Xx{o:) < k + clogfc}. 

Proof: The algorithm is a follows. Let D/- = {{y,i) \ 
K{y) < i < k} C D. Enumerate pairs (y, i) G Dk until a 
pair {x, «o) appears and form a list of all enumerated pairs. 
For a < ao define A(a) to be the minimum i + log l^] over 
all S' 3 a; such that a pair (a:, i) with i < a is in the list. 
For ao < a < A: let A(a) = k. 

For every a > ao we have Aa;(a) > K{x) — 0(1) > 
k-Xa,{ao) + K{x)-0{l) = X{a)-5 a,nd A^(a) < A^(ao) < 
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k+{K{ao)~ K{x)) <X{a) + S. 

For every a < ao we have Xx{a) < X[a). So it remains 
to show that for every e < a < ao we have A(a) < Xx{a — 
e) + 5. We will prove a stronger statement: A(q;) = Xx{a) 
for every a < ao — e provided c is chosen appropriately. 
To prove this it suffices to show that all for all S with 
K{S) < ao — e the pair {S, K{S)) belongs to the list. 

By Theorem rVIII.4I Item (i) we have A^; (|to^ | +ci log k) < 
k + C2\ogk. That is, ai < \m^\ + cilog/c if c > C2 and 
ao — s — ao ~ ao + ai — c log k < | + (ci — c) log k. 

From the proof of Theorem IVIII.2I we see that there is 
a constant C3 such that for every y with K{y) < \m^\ — 
calogfc the index of {y,K(y)) in the enumeration of Dk 
has less than |to^| common bits with N''. Assuming that 
c > Ci + C3 we obtain that the indexes of all pairs (y, K{y)) 
with K[y) < qq — £ in the enumeration of Dk are less than 
I^. ■ 

B. Randomness Deficiency Function 

The function fixia) is computable from x, a given an 
oracle for the halting problem: run all programs of length 
< a dovetailed fashion and find all finite sets S containing x 
that are produced. With respect to all these sets determine 
the conditional complexity K{x \ S*) and hence the ran- 
domness deficiency S{x \ S). Taking the minimum we find 
Px{(y)- All these things are possible using information from 
the halting problem to determine whether a given program 
will terminate or not. It is also the case that the function 
/3a: (a) is upper semi-computable from x,a,K{x) up to a 
logarithmic error: this follows from the semi-computability 
of Xx{a) and Theorem IIV. 81 More subtle is that Px is not 
semi-computable, not even within a large margin of error: 

Theorem D.6: The function I3x{a) is 

(i) not lower semi-computable to within precision |a::|/3; 
and 

(ii) not upper semi-computable to within precision 
\x\/\og^\x\. 

Proof: (i) The proof is by contradiction. Assume 
Item (i) is false. Choose an arbitrary length n. Let (i 
be a function defined by (3{i) = § for < i < ^, and 
equal otherwise. Then the function Px with x of length 
n, corresponding to /3, by Corollary IIV. 91 has x with k — 
K{x) satisfying /3(0) = n — k ± 0{logn) so that k ~ ± 
O(logn). Moreover, /3x{i) = f ± O(logn) for O(logn) < 
i < f-O(logn), and (3x{i) = O(logn) fori > f -fO(logn). 
Write the set of such x's as X. By dovetailing the lower 
approximation of (3x (i) for all x of length n and some i with 
f ^ * ^ J 5 by assumption on lower semi-computability of 
f3x, we must eventually find an x, if not x ^ X then x G 
X, for which the lower semi-computation of (3x{i) exceeds 
I — ^ — O(logn). But we know from Corollarv lIV.91 that 
Pxii) = O(logn) for i > K{x) + O(logn), and hence we 
have determined that i — O(logn) < K{x). Therefore, 
K{x) > f — O(logn). But this contradicts the well-known 
fact that there is no algorithm that for any given n finds 
a string of complexity at least /(n) where / is a computable 
total unbounded function. 



(ii) The proof is by contradiction. Assume Item (ii) is 
false. Fix a large length n — 2^ and let Ai = {0, 1}", so 
that a — 21ogfc > K{Ai). Let a; be a string of length 
n, let A^" < 2"+^ be the number of halting programs of 
length at most a, and let A = {Ax, . . . , Am} be the set 
of all finite sets of complexity at most a. Since a; G Ai, 
the value I3x{a) is finite and I3x{a) — min^g_4{(5(x | A)}. 
Assuming fix is upper semi-computable, we can run the 
following algorithm: 

Algorithm: Given A^", a, and x, 

Step 1: Enumerate all finite sets A = {Ai, . . . , Am} of 
complexity K{Ai) < a. Since we are given N^^a we can 
list them exhaustively. 

Step 2: Dovetail the following computations simultane- 
ously: 

Step 2.1: Upper semi-compute fix{o), for all x of length 

n. 

Step 2.2: For alH = 1, . . . , to, lower semi-compute S{x \ 
A) = \og\A\~K{x\A,). 

We write the approximations at the tth step as PHa), 
6*{x I Ai), and K*{x \ Ai), respectively. We continue the 
computation until step t such that 

pUa) < min{dUx I A)} + n/ \og'^ n. 

This t exists by the assumption above. By definition, 
minAGx{5(x I A)} = flx{a) < /3^(a). Let A^ denote 
the set minimizing the right-hand side. (Here we use 
that X belongs to a set in A.) Together, this shows that 
log|A^|-/3*(a) < K{x I A"") and log (a) > K\x \ 

A^) — n/ \og^ n> K{x \ A^) — n/ log'' n) . Thus we obtained 
an estimation log — Pxiot) oi K{x \ A^) with precision 
n/\o^n. We use that K{x \ A^) is a good approximation 
to K{x): 

K{x I A^) - ci < K{x) < K{x I A^) -I- \A^\ + ci 
<K{x I A^) + a + ci, 

where ci is a constant. Consequently, 

K{K{x) I x) < K{N", a, K{x) - log + /?*(«)) + C2. 

where the constant C2 is the length of a program to re- 
construct a,N" and K{x) - log|A^| -I- /3*(a) < a -I- 
ci -|- n/log^n, and combining this information with the 
conditional information x, to compute K{x). Observing 
KI^N") = a - K{a) + 0(1) by and substituting 

a = 2 log log n, there is a constant C3 such that 

K{K{x) I a;) < 2 loglogn -I- \ogn — 41oglogri -I- C3. 

However, for every n, we can choose an x of length n such 
that K{K{x) I x) > log n — log log n by [H|, which gives the 
required contradiction. ■ 
Open question. Is there a non-increasing (with respect 
to a) upper semi-computable function fx{a) such that, for 
all X, Pxia) = £{fx{a)) for e = 5 = 0(log|x|) (or for 
e = 5 = o{\x\))'! 
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